Re: [pcre-dev] Unexpected result when zero-length global mat…

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Unexpected result when zero-length global matching and pattern contains \G
On Sat, 30 Jun 2018, ND via Pcre-dev wrote:

> I can imagine that starting offset need to move by one character if
>   there is an empty match
>      AND
>   startoffset argument of this empty match is equal to ovector[0] (
> So  startoffset == ovector[0] == ovector[1]

>
> If ovector[0]==ovector[1] but ovector[0]>startoffset then there was a move
> along subject string during the matching, and we don't need another move by
> one character.


This doesn't work. I tried it, with some debugging to show the values.
It worked in the simple test case, but in the main tests this test and
some others failed:

/^/gm
    a\nb\nc\n
 0: 
~~start=0 end=0 offset=0
 0: 
~~start=2 end=2 offset=1
** PCRE2 error: global repeat returned the same string as previous
** Global loop abandoned


For an unanchored match, different values of startoffset can still give
the same match, which is what we don't want.

What is needed is not startoffset, which gives the offset that is first
tried, but the actual start-of-match offset that gives the match result.
This is not currently passed back from pcre2_match() or
pcre2_dfa_match() or pcre2_jit_match().

> Philipp, the aim of my post was not to get a help on how this result happens
> inside PCRE2 now. But to note that PCRE2 result in this situation is not
> correspond with neither logical, nor Perl result.


Note that this effect is in pcre2test, not in the basic library. /g is
implemented entirely in pcre2test, so the basic library does not know it
is handling a repeat match.

I will see if I can think of an easy way of making pcre2test do better,
without having do a lot of work on pcre2_match() and pcre2_dfa_match()
and pcre2_jit_match(), but I do not regard this as very important
because using things like \G in lookbehind assertions is rare and
also quite eccentric.

Philip

--
Philip Hazel