Re: [pcre-dev] Max_lookbehind issues

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: pcre-dev
Subject: Re: [pcre-dev] Max_lookbehind issues
On Sat, 10 Aug 2019, ND via Pcre-dev wrote:

> I would appreciate if at first we reach a consensus on these suggestions
> before make any rewrite of partial matching that you gonna do.


I do not intend to make any more changes to the partial matching code. I
may do further updates to the documentation.

> As a lookahead is an independent pattern, then we can calculate it's own
> maxlookbehind and add it to outer lookbehind own maxlookbehind. We don't care
> much about that "total" lookbehind be exactly minimum optimal value. We really
> must care that it not less then this value.


If you don't care about optimal values, why bother with lookbehind at
all? Just retain the entire first segment after a partial match.

> > >/(?<=\A.)/info,allusedtext
> > >Capture group count = 0
> > >Max lookbehind = 1
> >
> >This is correct, because the lookbehind does indeed just move back by
> >one character.
> >
>
> Not correct. We already talked about \A is in it's core equal (?<!.)
> regardless of how it really written in PCRE code.
> It is a lookbehind assertion with 1 length.


Unfortunately, we are talking about different things here. The length of
(?<=\A.) is 1 because it matches 1 character. When it is processed, the
current point is moved back by 1. There is then a check that the current
point is at the start, but no previous character is inspected.

> Consider an example. Let's subject "abc" come by two segments "ab" and "c".
> First we try to match segment "ab":
>
> /(?<=\Ab)c/
> ab\=ph
> No match
>
> After this we keep maxlookbehind=1


Why? It hasn't given a partial match. After no match you should throw
away everything.

> last symbols ("b") and concatenates they
> with second segment:
>
> /(?<=\Ab)c/
> bc\=offset=1
> 0: bc


I do not see this, either with 10.33 or the current code. I see this:

PCRE2 version 10.33 2019-04-16
re> /(?<=\Ab)c/
data> bc\=offset=1

0: c
data>


which is correct, of course, for an independent match.

> It's a wrong result as it's obvious that whole subject must nomatch.


Yes, I see what you are saying. I think this shows that my attempt to
suggest using max_lookbehind for partial matching does not work for a
number of cases. In other words: max_lookbehind is not what you want it
to be. That means it cannot be used for partial matching.

I think the answer is to recommend that the entire previous segment is
kept after a partial match.

It is also the case that \A (and probably \G as well) are confusing and
useless in partial matching.

Philip

--
Philip Hazel