Re: [pcre-dev] Partial match at end of subject

Top Page
Delete this message
Author: ph10
Date:  
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] Partial match at end of subject
I have just committed a patch that makes some small changes to the way
partial matches are handled in the interpreter. I hope Zoltán will in
due course pick these up for the JIT. (There are new tests at the end of
testinput2 which have no_jit set at the moment.)

The changes are really quite small, but I think they address some of the
issues that have been discussed at length here. I am sorry that it has
taken so long for me to see the essential points, but I think I do how
have the general principles clear. The result is, in fact, two very
minor changes:

(1) The "must have inspected at least one character" condition for
recognizing a partial match is now extended with "OR the pattern
must contain a lookbehind of non-zero length". This applies to both hard
and soft partial matches. These two conditions ensure that a partial
match is recognized when there is a possibility that adding more
characters may enable a complete match to be found.

Interestingly, I discovered that I had documented this situation already
when (in pcre2partial) I wrote:

For this reason, a "no match" result should be interpreted as "partial
match of an empty string" when the pattern contains lookbehinds.

This sentence has now been removed from pcre2partial because an empty
partial match is now given.

(2) It was already documented that \z and \Z should not match at the end
of a subject if PCRE2_PARTIAL_HARD is set. This was not working when no
characters had been inspected (and, after (1) was implemented, still not
working for non-lookbehind patterns). I have made patterns such as /\z/
give appropriate partial matches.


Further points:

On Fri, 19 Jul 2019, ND via Pcre-dev wrote:

> Alternative suggestion may be:


<snip>

> Disadvantages:
> 1. It may be a breaking change.


Indeed, and that is one reason I have not done it. Also, the changes I
*have* made are very small, which I like. :-)


Finally:

There is still the problem of patterns for which a return value "no
match in this segment and it will never match however many more
characters are added" would be useful. ND quoted /(*COMMIT)(*F)/ as a
simple example. Is (*COMMIT) the only way this might happen?

There is an item on the Wish List requesting a way of determining
whether a match was failed by a start-up optimization or by running a
matching engine. I haven't done anything about it because it would
require JIT work.

What could be done is to add a new field to the match data that records
why a match failed. A new function (e.g. pcre2_get_fail_reason) could
return this to the user. Possible returns could be:

PCRE2_FAILEDBY_START_OPTIMIZATION
PCRE2_FAILEDBY_INTERPRETER
PCRE2_FAILEDBY_INTERPRETER_COMMIT

PCRE2_FAILEDBY_JIT_START_OPTIMIZATION
PCRE2_FAILEDBY_JIT
PCRE2_FAILEDBY_JIT_COMMIT

PCRE2_FAILEDBY_DFA_START_OPTIMIZATION
PCRE2_FAILEDBY_DFA_INTERPRETER

Could this be useful?

Philip

--
Philip Hazel