Re: [pcre-dev] Partial match at end of subject

Top Page
Delete this message
Author: ND
Date:  
To: Pcre-dev
Subject: Re: [pcre-dev] Partial match at end of subject
On 2019-07-13 16:47, ph10 wrote:
> On Sat, 13 Jul 2019, ND via Pcre-dev wrote:
>> PCRE2_PARTIAL_HARD is intended for multisegment matching. I think when
> this
> > option is set it means: this subject IS incomplete, it's only a
> non-last part
> > of a certain "entire" subject.
>It was never intended to mean "this subject is incomplete", rather "this
> subject MAY BE incomplete". However ...
>


I don't agree. It can't be "MAY BE". Multisegment matching based upon fact
that before executing pcre2_match() we exactly know is subject complete or
not. Without this knowledge there is no way to treat arrived segment of
subject in appropriate manner. When we know that subject IS incomplete we
call pcre2_match() with PCRE2_PARTIAL_HARD option. When last segment
arrived we call pcre2_match() without PCRE2_PARTIAL_HARD.
At its core \z is positive lookahead assertion that want to inspect next
character of subject.
It is something like (?!\C).
Ups! When I write this I saw that (?!\C) and other patterns also returns
complete match:

PCRE2 version 10.33 2019-04-16
/(?!\C)/
ab\=ph
0:

PCRE2 version 10.33 2019-04-16
/c*/
ab\=ph
0:

I think it's obvious that this examples (especially brightly the last)
should return "no match". Why PCRE decided that segment "ab" is last
(complete)? It is a user work to tell about it via setting
PCRE2_PARTIAL_HARD option.


> . Are we at the end of the subject?   If no, backtrack
> . Is partial matching allowed?        If no, continue matching
> . Have we inspected any characters?   If no, continue matching
> . Is it partial hard?                 If yes, return a partial match


I see this algorithm is for both PARTIAL_SOFT and PARTIAL_OPTIONS. Sorry I
don't thinking a lot about PARTIAL_SOFT and no have practice with it. I
propose following algorithm (for PARTIAL_HARD only disregarding the
existence of PARTIAL_SOFT):

. Are we at the end of the subject?   If no, backtrack
. Is partial hard matching allowed?   If no, continue matching
. Have we inspected any characters?   If yes, return a partial match    
Else return "no match"


And same algorithm should be involved in other cases such as examples
above.


> The only way this could be changed would be to make some kind of
> exception for \z and I do not think that is a good idea.


I think no exception needed. It should be processed in a common way. We
see that it's not only \z issue.