[pcre-dev] [Bug 1504] DFA matching seems to have regressed, …

Góra strony
Delete this message
Autor: Philip Hazel
Data:  
Dla: pcre-dev
Temat: [pcre-dev] [Bug 1504] DFA matching seems to have regressed, causing GLib test failure
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1504




--- Comment #7 from Philip Hazel <ph10@???> 2014-07-21 18:44:21 ---
On Mon, 21 Jul 2014, Simon McVittie wrote:

> What message should I be passing on to the GLib maintainers? We've ruled out
> "the new PCRE is wrong", leaving the possibilities as either "the new PCRE is
> intentionally not fully compatible with the old" or "GLib's regression tests
> should not have been asserting that that precise behaviour is present".


Oh dear. You seem to have discovered a can of worms here, caused by
oversights on my (and others) part as the PCRE code has evolved. The
documentation is not as helpful as it might be. In the pcreapi page it
says:

A second matching function, pcre_dfa_exec(), which is not
Perl-compatible, is also provided. This uses a different algorithm for
the matching. The alternative algorithm finds all possible matches (at
a given point in the subject), and scans the subject just once (unless
there are lookbehind assertions). However, this algorithm does not
return captured substrings.

But in the pcrematching page, it modifies the statement about finding
all possible matches:

PCRE's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++" because
there is no point even considering the possibility of backtracking
into the repeated digits. For DFA matching, this means that only one
possible match is found. If you really do want multiple matches in
such cases, either use an ungreedy repeat ("a\d+?") or set the
PCRE_NO_AUTO_POSSESS option when compiling.

I am at present in the middle of developing an entirely new API for PCRE
(called PCRE2, and discussed on the list some months ago). Once this is
done (most of the code is done and I'm working on converting the tests),
there will be a complete revision of the documentation, and I will try
to improve the DFA documentation to make it all clearer. I think the
bottom line is "please use PCRE2_NO_AUTO_POSSESS if you want to get all
possible matches from DFA matching" but it needs more explanation and
examples.

> If the new PCRE is intentionally not fully compatible with the old, perhaps we
> should be looking into a SONAME bump and a coordinated transition...


The changes were intentional (and I'm sure I bumped something, but
perhaps not enough) but we obviously didn't recognize the extent of the
incompatibility. As PCRE tries to track Perl, there may well be other
things like (?P<1>) in future ... in fact I have just discovered today
that Perl's treatment of \8 and \9 has changed in the absence of groups
numbered 8 or 9, and its treatment of \c when not followed by a
printable ASCII character is also different (it now gives an error).

> If GLib's regression tests are just being too picky, and should not have been
> making those assertions, then that's also useful information, and perhaps
> points to GLib's documentation being too specific about the expected results.
> What result should be expected from matching a+ against aaa in this mode?


I don't think the tests are too picky. This has flagged up something
that can be improved.

I think the DFA matching process should be much more clearly laid out in
the PCRE documentation; the extension of auto-possessification has
changed the situation. It should say something like this for DFA
matching /a+/ (i.e. a complete pattern, not as part of something else):

. Without PCRE_NO_AUTO_POSSESS the result is the single string "aaa".
. With PCRE_NO_AUTO_POSSESS the result is three matches, "aaa", "aa", and "a".

For normal (non-DFA) matching, the result is of course just "aaa".

> The GLib documentation currently says
>
> """
> Using the standard algorithm for regular expression matching only the longest
> match in the string is retrieved, it is not possible to obtain all the
> available matches. For instance matching "<a> <b> <c>" against the pattern
> "<.*>" you get "<a> <b> <c>".
>
> This function uses a different algorithm (called DFA, i.e. deterministic finite
> automaton), so it can retrieve all the possible matches, all starting at the
> same point in the string. For instance matching "<a> <b> <c>" against the
> pattern "<.*>;" you would obtain three matches: "<a> <b> <c>", "<a> <b>" and
> "<a>".
> """


That is correct. When there are explicit possessive quantifiers in the
pattern, the number of available matches may be smaller, but the creator
of the pattern should realize this. The problem situation is when PCRE
does some auto-possessification behind the user's back - this won't
always cause a problem, but it can, as we have seen, especially if more
cases are added to the auto-possessification code (which is what has
just happened).

If the intention of the GLib function is always to provide all possible
matches, then I would always recommend using PCRE_NO_AUTO_POSSESS.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email