Re: [pcre-dev] Partial match at end of subject

Author: ph10
Date:
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Partial match at end of subject

On Mon, 15 Jul 2019, ND via Pcre-dev wrote:

> This option is added ten years ago EXACTLY for multisegment matching.
> Please read a very first proposal post and thread about it. Thats how
> partial_hard is born:
> https://lists.exim.org/lurker/message/20090524.142622.cb850f3a.en.html

Your memory is much better than mine. :-) However, this does mean that
partial hard matching has been the way it is for 10 years. For this
reason alone, I am very reluctant to make a change, because somebody,
somewhere is probably relying on the current behaviour.

> With multisegment matching we want that matching result be exactly the same as
> if we match a whole subject at once!

That is a good point, but I'm not sure it can always be achieved, and I
am not happy with your example:

/c*+(?<=[bc])/aftertext
abc
0: c
0+
ab\=ph
0:
0+

Your pattern encodes the requirement "find zero or more c's preceded by
b or c". When the input is "ab" that is exactly what it has done, which
is why it shows a complete match. It would make no sense never to give a
full match when \=ph is used. This is surely correct:

/abc/
xyzabcdef\=ph
0: abc

I'm trying to understand what the conditions are for a change in
behaviour. The issue is what should happen when a possible "partial
match" situation occurs, but no characters have been inspected. This
happens in this example:

/c*+(?<=[bc])/aftertext
ab\=ph
0:
0+

If the subject is "abc" you get a partial match, because one character
has been inspected. So what should happen when a possible partial match
has not inspected any characters? These are the choices:

1. Backtrack, maybe find a complete match (happens now).
2. Give an immediate "no match" - wrong for the test above.
3. Return a partial match of no inspected characters - also wrong.

The last one is wrong in general because it means that all unanchored
patterns would give partial matches for example:

/abc/
xyz\=ph

I don't think it is right for that to yield a partial match. (I'm
assuming no start-up optimizations.)

> I'll be very happy if you try to reconsider your approach to PCRE_PARTIAL_HARD
> and totally associate this option with multisegment matching purposes. Because
> it's what it originally intended for.
> But tell please if you know about another practical use of this option that
> force you to change it's original aims.

The problem is that I don't know what people use it for. It's been
around for 10 years, and all my previous experience is that people use
software in all kinds of strange ways not foreseen by its creator.

I think the only compromise here is perhaps to add a new option, to give
changed behaviour, but I do not know what the change should be.

Aha! I think I have spotted how to distinguish between /c*+(?<=[bc])/
and /abc/. The first one will have a non-zero max lookbehind. So,
something like this is needed:

. Invent a new option called PCRE2_NOTEOS. This would do two things:

(1) $, \z, and \Z would never match. This would deal with the /\z/
example.

(2) If a hard partial match is possible, but no characters have been
inspected, AND max lookbehind is non-zero, return PCRE2_PARTIAL.
Otherwise backtrack as now.

I am not sure that I like that; it seems messy and hard to explain, but
at the moment it's the best I can come up with. More thought is needed,
especially to see if (2) is all that is required, and if it would
adversely affect other patterns.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Zoltán Herczeg at
	ND at