Thanks for the detailed response.
Unfortunately, in our environment the caller does not control the buffers
and thus cannot solve the problem in the way you presented. This is also the
reason we chose pcre_dfa_exec() even after all enhancements in pcre_exec
that are introduced in 8.00 which I am currently testing.
Let me re-put what I was trying to suggest:
Suppose a new return value is introduce: PCRE_PARTIAL_ALTERNATIVE. This
value will be returned in cases of partial matches where the subject ends in
the alternative part of the re. In any other case of a partial match,
the regular PCRE_PARTIAL will be returned. This will allow the caller to
identify cases where matches across boundaries might be missed.
For example, suppose the subject is "xxxxmmmmaaammmxxxx" where xxxx does not
match the re, mmm matches the non alternative part and aaa matches the
alternative part of the re. If the subject is given in two buffers:
"xxxxmmm" and "maaammmxxxx" then PCRE_PARTIAL will be returned for the first
buffer and the caller can continue to the second buffer using the
PCRE_DFA_RESTART
options. Alternatively, if the subject is given in the following two
buffers: "xxxxmmmma" and "aammmxxxx", the caller will get the
PCRE_PARTIAL_ALTERNATIVE return code. In such a case the caller can
construct a temp buffer "mmmmaaam" (by copying the end of the first buffer
and the start of the second) as the longest part of the alternative part of
the re is known. For the temp buffer PCRE_PARTIAL will be returned and then
the caller can continue scanning the second buffer from the right offset (3
in our example) and find the match. This will limit the amount of copying in
envs where the techniques you demonstrated is not possible.
Another enhancement might be to return PCRE_PARTIAL_ALTERNATIVE only when
the alternative part is problematic (i.e. non-anchored and so on..). This
can probably be computed in compile time.
Thanks again,
E.
On Fri, Oct 16, 2009 at 10:30 AM, Philip Hazel <ph10@???>wrote:
> On Thu, 15 Oct 2009, Eldar Kleiner wrote:
>
> > We were wondering what do you think about adding the following pcre
> behviour
> > when multi segment matching is used with pcre_dfa_exec():
> >
> > At the moment, patterns that contain alternatives at the top level are
> > somehow problematic as only alternatives that match at lowest point in
> the
> > subject are remembered. As a result matches might be missed for
> alternatives
> > which share the same characters.
> >
> > We thought it might be useful to be able to define the number ?saved?
> points
> > or at least return an indication that matches might be missed. (and then
> let
> > the caller of pcre_dfa_exec) decides if linearization should be used to
> find
> > possible lost matches.
>
> The Bad News:
>
> The problem here is that pcre_dfa_exec() looks no further along the
> subject once it has found a partial match. Even if it did, there is
> nowhere for it to remember anything. I became even more aware of this
> recently when doing some work on partial matching for pcre_exec() (*not*
> pcre_dfa_exec()).
>
> The Slightly Better News:
>
> I am hoping to release 8.00 early next week. This contains improvements
> to partial matching with pcre_exec() (removing all restrictions and
> giving feedback about the partially matched string). It should now be
> possible to do a kind of multi-segment matching using pcre_exec()
> provided that the matched string is short enough to fit in your buffer.
>
> For example, roughly speaking, if your buffer contains "xxxxxxxmmm"
> where "mmm" is the partial match at the end of the buffer, you can throw
> away "xxxxxxxx", move "mmm" to the start of the buffer, refill the
> buffer, and then try the whole match again with "mmmyyyyyy". If it
> doesn't match at the start, it will move on as normal, so you would
> find, for example "myy" if that was a match.
>
> This technique can, of course, also be used with pcre_dfa_exec(). I said
> "roughly speaking" above because there is an issue with lookbehinds. The
> code for both functions now passes back the offset of the earliest
> character inspected during the partial match (which might be earlier
> than the actual match point), but it is theoretically possible for a
> lookbehind later in the pattern to look back even further. Some
> protection against this happening is to leave extra characters at the
> start of the buffer. It doesn't seem to me to be very likely that people
> would actually use patterns that match a few characters and then
> lookbehind beyond what has already been matched, but I have learned
> never to underestimate what actually gets done. :-)
>
> In the documentation for the 8.00 release I have written this:
>
> 4. Patterns that contain alternatives at the top level which do not
> all start with the same pattern item may not work as expected when
> pcre_dfa_exec() is used. For example, consider this pattern:
>
> 1234|3789
>
> If the first part of the subject is "ABC123", a partial match of the
> first alternative is found at offset 3. There is no partial match for
> the second alternative, because such a match does not start at the
> same point in the subject string. Attempting to continue with the
> string "7890" does not yield a match because only those alternatives
> that match at one point in the subject are remembered. The problem
> arises because the start of the second alternative matches within the
> first alternative. There is no problem with anchored patterns or
> patterns such as:
>
> 1234|ABCD
>
> where no string can be a partial match for both alternatives. This is
> not a problem if pcre_exec() is used, because the entire match has to
> be rerun each time.
>
> I will add a sentence pointing out that this approach can also be used
> with pcre_dfa_exec().
>
> If you want to try the 8.00 release candidate, you will find it here:
>
>
> ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.gz
>
> ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.bz2
>
> ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.zip
>
> Philip
>
> --
> Philip Hazel
>