[pcre-dev] [Bug 1504] DFA matching seems to have regressed, …

Top Page
Delete this message
Author: Simon McVittie
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1504] DFA matching seems to have regressed, causing GLib test failure
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1504




--- Comment #6 from Simon McVittie <smcv@???> 2014-07-21 16:45:54 ---
To be clear, the only concrete problem this has caused for me is "the GLib
tests didn't fail with the old PCRE, and they do fail with the new PCRE". I
have no idea which applications make which uses of DFA, or who added that
functionality or that test.

What message should I be passing on to the GLib maintainers? We've ruled out
"the new PCRE is wrong", leaving the possibilities as either "the new PCRE is
intentionally not fully compatible with the old" or "GLib's regression tests
should not have been asserting that that precise behaviour is present".

If the new PCRE is intentionally not fully compatible with the old, perhaps we
should be looking into a SONAME bump and a coordinated transition...

If GLib's regression tests are just being too picky, and should not have been
making those assertions, then that's also useful information, and perhaps
points to GLib's documentation being too specific about the expected results.
What result should be expected from matching a+ against aaa in this mode?

The GLib documentation currently says

"""
Using the standard algorithm for regular expression matching only the longest
match in the string is retrieved, it is not possible to obtain all the
available matches. For instance matching "<a> <b> <c>" against the pattern
"<.*>" you get "<a> <b> <c>".

This function uses a different algorithm (called DFA, i.e. deterministic finite
automaton), so it can retrieve all the possible matches, all starting at the
same point in the string. For instance matching "<a> <b> <c>" against the
pattern "<.*>;" you would obtain three matches: "<a> <b> <c>", "<a> <b>" and
"<a>".
"""

Is that a fair description of how the DFA matching is meant to work / what it
is meant to do? Or is it more like "it returns a larger but not necessarily
exhaustive set of matches"?


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email