Re: [pcre-dev] Multi segment matching using pcre_dfa_exec

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Eldar Kleiner
CC: Pcre-dev
Subject: Re: [pcre-dev] Multi segment matching using pcre_dfa_exec
On Thu, 15 Oct 2009, Eldar Kleiner wrote:

> We were wondering what do you think about adding the following pcre behviour
> when multi segment matching is used with pcre_dfa_exec():
>
> At the moment, patterns that contain alternatives at the top level are
> somehow problematic as only alternatives that match at lowest point in the
> subject are remembered. As a result matches might be missed for alternatives
> which share the same characters.
>
> We thought it might be useful to be able to define the number ?saved? points
> or at least return an indication that matches might be missed. (and then let
> the caller of pcre_dfa_exec) decides if linearization should be used to find
> possible lost matches.


The Bad News:

The problem here is that pcre_dfa_exec() looks no further along the
subject once it has found a partial match. Even if it did, there is
nowhere for it to remember anything. I became even more aware of this
recently when doing some work on partial matching for pcre_exec() (*not*
pcre_dfa_exec()).

The Slightly Better News:

I am hoping to release 8.00 early next week. This contains improvements
to partial matching with pcre_exec() (removing all restrictions and
giving feedback about the partially matched string). It should now be
possible to do a kind of multi-segment matching using pcre_exec()
provided that the matched string is short enough to fit in your buffer.

For example, roughly speaking, if your buffer contains "xxxxxxxmmm"
where "mmm" is the partial match at the end of the buffer, you can throw
away "xxxxxxxx", move "mmm" to the start of the buffer, refill the
buffer, and then try the whole match again with "mmmyyyyyy". If it
doesn't match at the start, it will move on as normal, so you would
find, for example "myy" if that was a match.

This technique can, of course, also be used with pcre_dfa_exec(). I said
"roughly speaking" above because there is an issue with lookbehinds. The
code for both functions now passes back the offset of the earliest
character inspected during the partial match (which might be earlier
than the actual match point), but it is theoretically possible for a
lookbehind later in the pattern to look back even further. Some
protection against this happening is to leave extra characters at the
start of the buffer. It doesn't seem to me to be very likely that people
would actually use patterns that match a few characters and then
lookbehind beyond what has already been matched, but I have learned
never to underestimate what actually gets done. :-)

In the documentation for the 8.00 release I have written this:

4. Patterns that contain alternatives at the top level which do not
all start with the same pattern item may not work as expected when
pcre_dfa_exec() is used. For example, consider this pattern:

    1234|3789


If the first part of the subject is "ABC123", a partial match of the
first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the
same point in the subject string. Attempting to continue with the
string "7890" does not yield a match because only those alternatives
that match at one point in the subject are remembered. The problem
arises because the start of the second alternative matches within the
first alternative. There is no problem with anchored patterns or
patterns such as:

    1234|ABCD


where no string can be a partial match for both alternatives. This is
not a problem if pcre_exec() is used, because the entire match has to
be rerun each time.

I will add a sentence pointing out that this approach can also be used
with pcre_dfa_exec().

If you want to try the 8.00 release candidate, you will find it here:

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.gz
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.bz2
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.zip

Philip

--
Philip Hazel