Re: [pcre-dev] Partial match at end of subject

Author: ND
Date:
To: Pcre-dev
Subject: Re: [pcre-dev] Partial match at end of subject

On 2019-07-15 15:24, ph10 wrote:

> My point about partial matching meaning "may be incomplete" is still
> true. Partial matching was not invented originally for multi-segement
> matching, but for dynamically checking input. For example, if a user is
> typing an 8-digit number, as each character is received it can be added
> to the input and then the input can be matched to ^\d{8}\z with apartial
> option. Later, other features were added to help multi-segmentmatching.
>Also, "partial hard" means "return a partial match if it is found before
> a complete match". There is no requirement for any particular ordering.
> Therefore, for a pattern such as /\z/ a complete match may be foundand
> returned without any partial consideration.
>

Sorry I don't think about \z and EOL. But \z and EOL are not relate to
issue that I want to discuss.

Let's differentiate partial_soft and partial_hard. I admit you are right
about partial_soft meaning "may be incomplete".
But now we tell only about partial_hard.

This option is added ten years ago EXACTLY for multisegment matching.
Please read a very first proposal post and thread about it. Thats how
partial_hard is born:
https://lists.exim.org/lurker/message/20090524.142622.cb850f3a.en.html

First I suggest to name this option PCRE_ERROR_MULTISEGMENT that is
strongly about it real purpose. You name is PCRE_PARTIAL_HARD that was
excellent, but this not change it's exclusive aim - multisegment matching.

This option was born not alone but right away with last_bumpalong
returning result, that is further transforms to max_lookbehind. And this
last_bumpalong purpose was also exclusive for multisegment matching only.
Please read this thread is about first (afterborn) steps of
PCRE_PARTIAL_HARD development:
https://lists.exim.org/lurker/message/20090901.142330.58bea511.en.html

So the algorithm of multisegment matching is:
1. If segment is not last process it with PCRE_PARTIAL_HARD option and
hold max_lookbehind characters if continuing with next segment needed.
2. If segment is last then process it without PCRE_PARTIAL_HARD.

With multisegment matching we want that matching result be exactly the
same as if we match a whole subject at once!

Consider example:

/c*+(?<=[bc])/
abc
0: c

Now imagine that subject "abc" arrives by two segments: "ab" and "c".
We try to match whole subject using two calls to pcre2_match(). First is:

/c*+(?<=[bc])/
ab\=ph
0:

Stop. We hope to have only result "c". But now we see there is yet another
successful match - empty string at position 2. It's obvious that this
match is wrong as we treat it as **subject "abc" is successfully match
"c*+(?<=[bc])" at position 2 with empty string**.
And so all multisegment matching became wrong.

I'll be very happy if you try to reconsider your approach to
PCRE_PARTIAL_HARD and totally associate this option with multisegment
matching purposes. Because it's what it originally intended for.
But tell please if you know about another practical use of this option
that force you to change it's original aims.

Thanks a lot.

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at
	Zoltán Herczeg at