Re: [pcre-dev] Multi segment matching using pcre_dfa_exec

Top Page
Delete this message
Author: Eldar Kleiner
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Multi segment matching using pcre_dfa_exec
I am afraid my example was not clear at all!

Consider a similar example to the one from the documentation: (1234|3789)ABC
If the input is given in two buffers: "ABC123" and "789ABC", pcre_dfa_exec()
will return PCRE_PARTIAL for the first buffer and no match for the second
even if PCRE_DFA_RESTART is used. This is somehow problematic, since (in run
time) the caller gets no indication that a match was missed - "3789ABC".

The suggestion was to return PCRE_PARTIAL_ALTERNATIVE in this case. This
will allow the caller (when the second buffer arrives) to construct a temp
buffer "123789" (note that in this point the caller knows how much to copy
from the end of the first buffer and how much to copy from the second buffer
since the length of the alternative part is also known). pcre_dfa_exec()
will return PCRE_PARTIAL when it is fed with the temp buffer. The caller can
then continue to scan the second buffer from right place (index=3 after the
part that was copied to the temp buffer) and the PCRE_DFA_RESTART option. In
such a case a match will be returned.

If for the same regular expression the input is given in the following
buffers: "ABC123789" and "ABC" - i.e. the first buffer ends after
the alternative part of the re, then pcre_dfa_exec() should return PCRE_PARTIAL
and the caller can continue to the second buffer using PCRE_DFA_RESTART and
find the match.

The idea is to give an indication to the caller of pcre_dfa_exec() that
matches might have been missed and give her a chance to "fix" it. Without
such an indication, the Multi Segment Matching is useless (at least when the
re contains alternatives) for applications that must not lose matches.

Thanks,
E.

On Sat, Oct 17, 2009 at 6:44 PM, Philip Hazel <ph10@???> wrote:

> On Fri, 16 Oct 2009, Eldar Kleiner wrote:
>
> > For example, suppose the subject is "xxxxmmmmaaammmxxxx" where xxxx
> > does not match the re, mmm matches the non alternative part and aaa
> > matches the alternative part of the re. If the subject is given in two
> > buffers: "xxxxmmm" and "maaammmxxxx" then PCRE_PARTIAL will be
> > returned for the first buffer and the caller can continue to the
> > second buffer using the PCRE_DFA_RESTART options. Alternatively, if
> > the subject is given in the following two buffers: "xxxxmmmma" and
> > "aammmxxxx", the caller will get the PCRE_PARTIAL_ALTERNATIVE return
> > code. In such a case the caller can construct a temp buffer "mmmmaaam"
> > (by copying the end of the first buffer and the start of the second)
> > as the longest part of the alternative part of the re is known. For
> > the temp buffer PCRE_PARTIAL will be returned and then the caller can
> > continue scanning the second buffer from the right offset (3 in our
> > example) and find the match. This will limit the amount of copying in
> > envs where the techniques you demonstrated is not possible.
>
> I am sorry, I'm not sure what you mean. Do you perhaps mean "optional"
> when you say "alternative"? Otherwise I don't understand why matching
> "xxxxmmmma" should give a partial return at all - shouldn't "mmmm" give
> a complete match? Perhaps you can give me a real example.
>
> If you do mean "optional", let me try to understand more. Suppose we
> consider the pattern "abcd*". If the first buffer is "xxxabcd" and the
> second buffer is "ddddxxxx" then, with PCRE 7.9 you will get a complete
> match on the first buffer, and miss out on the longest match. Is this
> the problem?
>
> If so, the problem is supposed to be solved in 8.00 by the addition of
> the PCRE_PARTIAL_HARD option, which prefers a partial match over a
> complete match. However, you have found a bug! In the -RC1 code, neither
> pcre_exec() nor pcre_dfa_exec() returns PCRE_PARTIAL for the pattern
> "abcd*" when applied to "xxxabcd". I have spotted the error in the code
> in pcre_exec() and will then try to find out what's wrong in
> pcre_dfa_exec().
>
> Philip
>
> --
> Philip Hazel
>