Re: [pcre-dev] Max_lookbehind issues

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Max_lookbehind issues
On Sat, 27 Jul 2019, ND via Pcre-dev wrote:

> There are some kinds of problems that exist with max_lookbehind:


It was always a hack to try to make is possible to do multi-segment
matching using the normal matching function, something for which it was
not designed.

> - bugs
> - performance issues
> - brings excessive work to user
>
> Now I report only about potential bugs.


Unfortunately I believe we have reached the limit of what can be done to
the existing PCRE2 design to support multi-segment matching using
pcre2_match().

The code that is compiled for each lookbehind contains the fixed number
of characters by which the matching position is moved back before
running that lookbehind. When computing these values, it is easy to
remember the biggest one and that is the value that is how max
lookbehind was originally computed.

I did manage recently (with a bit of effort) to take note of directly
nested lookbehinds so that, for example, /(?<=(?<=..).)/ ends up with a
maximum of 3 rather than 2. This has changed the meaning of max
lookbehind, and now that I have looked further, I am not entirely sure
that it is the right thing to do. (As it has not been released, it could
be removed.)

However, a nested *lookahead* is handled as an independent subpattern. It
does not interact with an outer lookbehind. That is why you see this:

> /(?<=(?=(?<=.)).)/info,allusedtext
> Capture group count = 0
> Max lookbehind = 1


Each individual lookbehind has a length of 1. I have looked at the code
and I can see no easy way of doing anything about this.

> /(?<=\A.)/info,allusedtext
> Capture group count = 0
> Max lookbehind = 1


This is correct, because the lookbehind does indeed just move back by
one character.

> /(?<=\G.)/info,allusedtext
> Capture group count = 0
> Max lookbehind = 1


And this is the same.

Philip

--
Philip Hazel