Re: [pcre-dev] Max_lookbehind issues

Top Page
Delete this message
Author: ND
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Max_lookbehind issues
<ph10@???> писал(а) в своём письме Sat, 10 Aug 2019 14:03:50
+0300:

>> - bugs
>> - performance issues
>> - brings excessive work to user
>>
>> Now I report only about potential bugs.
>
> Unfortunately I believe we have reached the limit of what can be done to
> the existing PCRE2 design to support multi-segment matching using
> pcre2_match().
>


I think it's not mean that all problems that can be resolved with current
design will be anyway stay unresolved.
When I write quoted words in my previous post, I talk only about current
PCRE design.

Another suggestions that placed in sibling topic named "Partial match at
end of subject" are also within current PCRE design. I hope you will take
attention to it.

I would appreciate if at first we reach a consensus on these suggestions
before make any rewrite of partial matching that you gonna do.


> However, a nested *lookahead* is handled as an independent subpattern. It
> does not interact with an outer lookbehind. That is why you see this:
>
>> /(?<=(?=(?<=.)).)/info,allusedtext
>> Capture group count = 0
>> Max lookbehind = 1
>
> Each individual lookbehind has a length of 1. I have looked at the code
> and I can see no easy way of doing anything about this.
>


As a lookahead is an independent pattern, then we can calculate it's own
maxlookbehind and add it to outer lookbehind own maxlookbehind. We don't
care much about that "total" lookbehind be exactly minimum optimal value.
We really must care that it not less then this value.


>> /(?<=\A.)/info,allusedtext
>> Capture group count = 0
>> Max lookbehind = 1
>
> This is correct, because the lookbehind does indeed just move back by
> one character.
>


Not correct. We already talked about \A is in it's core equal (?<!.)
regardless of how it really written in PCRE code.
It is a lookbehind assertion with 1 length.

Consider an example. Let's subject "abc" come by two segments "ab" and
"c". First we try to match segment "ab":

/(?<=\Ab)c/
ab\=ph
No match

After this we keep maxlookbehind=1 last symbols ("b") and concatenates
they with second segment:

/(?<=\Ab)c/
bc\=offset=1
0: bc

It's a wrong result as it's obvious that whole subject must nomatch.

This bug is due PCRE calculates max_lookbehimd as 1. But must calculate as
2 as pattern is equal to
/(?<=(?<!.)b)c/



>> /(?<=\G.)/info,allusedtext
>> Capture group count = 0
>> Max lookbehind = 1
>
> And this is the same.
>

I not think that it's the same. May be it need some investigation...