Re: [pcre-dev] Unexpected result when zero-length global mat…

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Unexpected result when zero-length global matching and pattern contains \G
On Thu, 31 May 2018, ND via Pcre-dev wrote:

> PCRE2 version 10.31 2018-02-12
> /(?<=\G.)/g,replace=-
> abc
> 2: a-bc-
>
>
> Logically expected result: a-b-c-
>
>
> PCRE advances by one character between zero-length matches. But it seems it
> should not in this case.


It is *pcre2test* that is doing the advancing, not the PCRE2 library,
which only ever does one match at a time.

The use of \G in lookbehinds, like \K, can cause a lot of confusion.

\G is true when the current matching point is at the start of the
subject plus the starting offset (compare ^, which is only true at the
start of the subject in single-line mode). Consider the match without
the replace, and some added parentheses:

/(?<=\G(.))/g
abc
0:
1: a
0:
1: c

The /g operation in pcre2test starts by calling pcre2_match() with a
starting offset of 0. This defines where \G will match. However, the
lookbehind fails early, because there are no earlier characters, so the
match moves on (within the PCRE2 library) to try again from a new
starting position, but this does NOT change the original starting
offset.

This match succeeds. Normally, the next call to pcre2_match() from
pcre2test would pass the subject with a new starting offset, set to the
end of the previous match. However, the match was for an empty string,
so it moves on one character, so the starting offset is now 2. This time
it can lookbehind, but when it does, the matching point is not at the
starting offset, so once again it moves on within the PCRE library,
finding a match one character later.

In both cases, it finds a match one character later than the starting
offset.

Philip

--
Philip Hazel