Re: [pcre-dev] Multi segment matching using pcre_dfa_exec

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Eldar Kleiner
CC: pcre-dev
Subject: Re: [pcre-dev] Multi segment matching using pcre_dfa_exec
On Sat, 17 Oct 2009, Eldar Kleiner wrote:

> Consider a similar example to the one from the documentation: (1234|3789)ABC
> If the input is given in two buffers: "ABC123" and "789ABC", pcre_dfa_exec()
> will return PCRE_PARTIAL for the first buffer and no match for the second
> even if PCRE_DFA_RESTART is used. This is somehow problematic, since (in run
> time) the caller gets no indication that a match was missed - "3789ABC".
>
> The suggestion was to return PCRE_PARTIAL_ALTERNATIVE in this case.


Ah, so I had understood you in the first place, and then I got confused.
Sorry about that.

I am afraid that what you want is not possible, because once it has got
to the point of returning something, pcre_dfa_exec() does not try
further matches that start further along the subject. Even it if did,
there is nowhere that it could remember the resulting data for use with
PCRE_DFA_RESTART.

What happens is this:

. It starts to look for a match starting at "A"; this fails immediately.
. The same, starting at "B" and then "C".
. Then it starts at "1" and finds the partial match "123" but runs out
of data, so it returns PCRE_PARTIAL, leaving details of the partial
match in the workspace so that it can restart *that particular* match.

In order to do what you want, it would have to try more matches,
starting at "2" and then "3", and in each case, there would be data to
be remembered. Then somehow it would have to return "there are several
partial matches, starting at different points in the subject". This is
not possible without a complete re-design of pcre_dfa_exec(), and that
isn't going to happen - at least, it is not something that I am at all
likely to do at this stage in the life of the code. (And indeed also at
this stage in *my* life - I'm retired now, and though I'm happy to
maintain PCRE at the moment, I very much doubt I'll ever do any large-
scale enhancements.)

> This will allow the caller (when the second buffer arrives) to
> construct a temp buffer "123789" (note that in this point the caller
> knows how much to copy from the end of the first buffer and how much
> to copy from the second buffer since the length of the alternative
> part is also known).


But the caller can do the same, right now. When the caller gets the
PCRE_PARTIAL return, a temporary buffer starting "123" can be
constructed, joined onto more data from the second buffer, and the
entire match can be retried, *not* using PCRE_DFA_RESTART.

("The length of the alternative part is known" may be true for your
caller, but in a general pattern that cannot be true.)

> pcre_dfa_exec() will return PCRE_PARTIAL when it is fed with the temp
> buffer. The caller can then continue to scan the second buffer from
> right place (index=3 after the part that was copied to the temp
> buffer) and the PCRE_DFA_RESTART option. In such a case a match will
> be returned.


That is not general enough. How do you know that the same problem has
not happened again in the temp buffer? (Quite apart from the problem of
how pcre_dfa_exec() is to know which partial match it is restarting, the
one beginning with "123" or the one beginning with "3"?)

> The idea is to give an indication to the caller of pcre_dfa_exec() that
> matches might have been missed and give her a chance to "fix" it. Without
> such an indication, the Multi Segment Matching is useless (at least when the
> re contains alternatives) for applications that must not lose matches.


The only way not to lose matches in this environment is to re-do the
entire match process. So the caller has to be prepared to copy the
partial match into a temp buffer, concatenated with data from the next
input buffer, and restart. In other words, not to use PCRE_DFA_RESTART.
(And this works with pcre_exec() as well, of course).

I hope that is fairly clear. I regret that there is no way of doing what
you want. At least this conversation has benefitted the code, because I
did find bugs in the new additions, which I have fixed. Thank you for
raising the issue.


Philip

--
Philip Hazel