Re: [pcre-dev] Quantifying backtracking verbs

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Quantifying backtracking verbs
On Tue, 4 Jun 2019, ND via Pcre-dev wrote:

> PCRE2 version 10.33 2019-04-16
> /A(?:(*ACCEPT))?B/
> A
> No match
>
> /A(?:(*ACCEPT))?B/no_start_optimize
> A
> 0: A
>
> /A(*ACCEPT)?B/
> Failed: error 109 at offset 10: quantifier does not follow a repeatable item
> A
>
>
> I have a two questions with it:
> 1. Start optimizer brakes a result to "no match" from "match". Is there
> documented (I remember only example with (*COMMIT) where optimizer can make
> "match" from "no match")? May be there is a way to correct this PCRE
> optimization to not break a result.


It is documented that the start-up optimizations and the backtracking
control verbs such as (*ACCEPT) do not play well together. In
particular, the result of a match may change. Here is an extract from
the pcre2pattern documentation:

Note that (*COMMIT) at the start of a pattern is not the same as an
anchor, unless PCRE2's start-of-match optimizations are turned off, as
shown in this output from pcre2test:

      re> /(*COMMIT)abc/
    data> xyzabc
     0: abc
    data>
    re> /(*COMMIT)abc/no_start_optimize
    data> xyzabc
    No match


For the first pattern, PCRE2 knows that any match must start with "a",
so the optimization skips along the subject to "a" before applying the
pattern to the first set of data. The match attempt then succeeds. The
second pattern disables the optimization that skips along to the first
character. The pattern is now applied starting at "x", and so the
(*COMMIT) causes the match to fail without trying any other starting
points.

There is also a whole section called "Optimizations that affect
backtracking verbs" which contains this:

PCRE2 contains some optimizations that are used to speed up matching
by running some checks at the start of each match attempt. For
example, it may know the minimum length of matching subject, or that a
particular character must be present. When one of these optimizations
bypasses the running of a match, any included backtracking verbs will
not, of course, be processed.

> 2. If "(?:(*ACCEPT))?" syntax is allowed, then why a more ergonomic
> "(*ACCEPT)?" is disabled?


Repetition is allowed for groups such as (?:...) but not for individual
backtracking verbs (for which it is meaningless). PCRE2 is not clever
enough to realize that (?:(*ACCEPT))? has nothing other than (*ACCEPT)
inside (?:). If it did, I would expect an error. Repeating any group
that has an unconditional (*ACCEPT) does not seem meaningful but there
are cases such as (a(*ACCEPT))? that could perhaps be useful. Such a
group can only match zero or one times. It means "If 'a' end the match;
if not 'a' carry on with the pattern."

Philip

--
Philip Hazel