Re: [pcre-dev] Partial match at end of subject

Startseite
Nachricht löschen
Autor: ND
Datum: 2019-07-17 20:05 -000
To: Pcre-dev
Betreff: Re: [pcre-dev] Partial match at end of subject
On 2019-07-17 16:55, ph10 wrote:
> On Mon, 15 Jul 2019, ND via Pcre-dev wrote:
>> This option is added ten years ago EXACTLY for multisegment matching.
> > Please read a very first proposal post and thread about it. Thats how
> > partial_hard is born:
> > https://lists.exim.org/lurker/message/20090524.142622.cb850f3a.en.html
>Your memory is much better than mine. :-) However, this does mean that
> partial hard matching has been the way it is for 10 years. For this
> reason alone, I am very reluctant to make a change, because somebody,
> somewhere is probably relying on the current behaviour.
>


We can say this words about every detected old bug: somebody,
somewhere is probably relying on the current behaviour. Is there necessity
to add a new options or fix every bug?
I convinced that discussed PCRE behaviour is a bug. Because it totally
contradicts to originally planned aims of PARTIAL_HARD. Nobody want to
have such behaviour, nobody ask for it.
I think ability of multisegment matching is great global advantage of PCRE
over Perl and other regex engines. It allow to match streamed data that is
live everywhere around us. And behaviour when a sum of multisegment
matches gives different result than whole match is wrong and oversight.



> > With multisegment matching we want that matching result be exactly the
> same as
> > if we match a whole subject at once!
>That is a good point, but I'm not sure it can always be achieved.


It's not only good point. It's basic point, foundation of multisegment
matching.
Why you have doubts about that can be achieved?
If there is such examples then I'm afraid multisegment matching with PCRE
can't be live.

>/c*+(?<=[bc])/aftertext
> ab\=ph
> 0: 0+Your pattern encodes the requirement "find zero or more c's
> preceded by
> b or c". When the input is "ab" that is exactly what it has done, which
> is why it shows a complete match.


No. "c*+" is greedy and engine must try to inspect next character after
"b". More characters are needed to complete match.

Here is example:

/b++/
ab\=ph
Partial match: b

If we apply your words (When the input is "ab" that is exactly what it has
done, which
is why it shows a complete match) then it must be a complete match "ab"
and no partial match. But it's obvious that a complete match is wrong
result and current PCRE result is right.
This example have no differences with previous: more characters are needed
at the and of subject. Except that it have inspected symbol "b" while in
previous there are no characters inspected.

Doc properly says: "A partial match occurs during a call to pcre2_match()
when the end of the subject string is reached successfully, but matching
cannot continue because more characters are needed. However, at least one
character in the subject must have been inspected. The requirement for
inspecting at least one character exists because an empty string can
always be matched; without such a restriction there would always be a
partial match of an empty string at the end of the subject."

But docs don't quite describe what is a core of situation when more
characters are STILL NEEDED but no characters are inspected.
In the context of multisegment matching it mean that no characters from
current segment can belongs to the match, but it's possible that match can
be achieved in next segment(s). For current segment it's equivalent to "no
match". And PCRE outputs this result:

/abc/
xyz\=ph
No match

With
/c*+(?<=[bc])/aftertext
ab\=ph
we see absolutely same situation: matching cannot continue because more
characters are needed and no inspected symbols occured. So result must be
"no match" too.



> So what should happen when a possible partial match
> has not inspected any characters? These are the choices:
>
>
> 1. Backtrack, maybe find a complete match (happens now).
> 2. Give an immediate "no match" - wrong for the test above.
> 3. Return a partial match of no inspected characters - also wrong.


You are right: 3 is wrong.
1 is also wrong because more characters are STILL NEEDED.

Right is give the same result as /abc/xyz\=ph give (for reasons discussed
above).
My opinion wrong result of test above (/c*+(?<=[bc])/) is a bug (or
oversight when option was born) and should be fixed. There no need to add
new options (PCRE2_NOTEOS) and tangle PCRE.



And there is another question from multisegment matching that is not
solved: what PCRE should return if during matching segment it discover "no
match" of whole subject?

Here is example:

/cd/
ab\=ph

/(*COMMIT)(*F)/
ab\=ph

Both results are "no match". But for purposes of multisegment matching it
have different meaning. First should mean "current segment not belong to
successful match, but it's possible that next segments could be".
Second should mean "current and all next segments not belong to successful
match".

This "no matches" should be differentiate to caller. But how?