Autor: ph10 Fecha: A: ND Cc: Pcre-dev Asunto: Re: [pcre-dev] Partial match at end of subject
On Wed, 17 Jul 2019, ND via Pcre-dev wrote:
Let us ignore for the moment whether there should be a new option or
not, and try to figure out what new logic might be needed. I am going to
experiment with the suggestion I made earlier:
If a hard partial match is possible, return PCRE2_PARTIAL if EITHER
at least one character has been inspected, OR the pattern contains a
lookbehind of length 1 or more. If neither of these conditions is
true, return "no match".
I expect this logic to solve a lot of the examples we have been
discussing, but I am not sure whether it is sufficient for all cases.
(It does not deal with \z, but that is a different issue altogether.)
> Doc properly says: "A partial match occurs during a call to pcre2_match() when
> the end of the subject string is reached successfully, but matching cannot
> continue because more characters are needed. However, at least one character
> in the subject must have been inspected. The requirement for inspecting at
> least one character exists because an empty string can always be matched;
> without such a restriction there would always be a partial match of an empty
> string at the end of the subject."
>
> But docs don't quite describe what is a core of situation when more characters
> are STILL NEEDED but no characters are inspected.
Yes. That is the main point. I *think* the condition "pattern has a
backtrack" is the important one. Consider /c*+/ with a segment "ab". It
needs to inspect a character at the end of the string, but because there
is no possibility of backtracking, the result should be "this segment
cannot be part of the match", so "no match" is the right result.
> With
> /c*+(?<=[bc])/aftertext
> ab\=ph
> we see absolutely same situation: matching cannot continue because more
> characters are needed and no inspected symbols occured. So result must be "no
> match" too.
With my new idea this will give "partial match", I believe.
> My opinion wrong result of test above (/c*+(?<=[bc])/) is a bug (or oversight
> when option was born) and should be fixed. There no need to add new options
> (PCRE2_NOTEOS) and tangle PCRE.
Yes, I have come to agree with that. I will post again when I have done
experiments.
> And there is another question from multisegment matching that is not solved:
> what PCRE should return if during matching segment it discover "no match" of
> whole subject?
>
> Here is example:
>
> /cd/
> ab\=ph
>
> /(*COMMIT)(*F)/
> ab\=ph
>
> Both results are "no match". But for purposes of multisegment matching it have
> different meaning. First should mean "current segment not belong to successful
> match, but it's possible that next segments could be".
> Second should mean "current and all next segments not belong to successful
> match".
>
> This "no matches" should be differentiate to caller. But how?