On Tue, 24 Jan 2012, ND wrote:
> May be it can be useful to combine our approaches to reduce the number of
> cases when zero-length partial match may occur?
> So, zero-length partial match is allowed when:
> 1. it arises within a lookahead
> AND
> 2. pattern have lookbehind with non-zero length
I knew I had to think about this more. My idea was wrong. You don't need
lookahead assertions. Consider /b(?<=..)/ and "abc". There are probably
many different examples one can construct. The key thing, of course, is
the lookbehind.
The documentation for partial matching says this:
2. Lookbehind assertions at the start of a pattern are catered for in
the offsets that are returned for a partial match. However, in theory,
a lookbehind assertion later in the pattern could require even earlier
characters to be inspected, and it might not have been reached when a
partial match occurs. This is probably an extremely unlikely case; you
could guard against it to a certain extent by always including extra
characters at the start.
Obviously, this situation isn't as "extremely unlikely" as I thought...
I have had the following ideas:
An unanchored pattern could always give a zero-length partial match at
the end of the string. I chose not to do this, because at the time I did
not think it was ever useful. It turns out that it is useful, but only
if there is a lookbehind later in the pattern.
However, this means that "no match" actually means "zero-length partial
match at end of string". What does an application that is doing
multi-segment matching do with these results?
No match: get next segment and start matching at the start.
Partial match: get next segment and start matching at the start.
In other words, *the same thing*. However, what matters is whether any
of the previous segment is retained, in case there is a lookbehind.
PCRE could tell the application the maximum lookbehind length, but what
it cannot tell is whether there is a lookbehind further along the path
that is being matched. So I don't think it should change its result.
However, an application can choose to treat "no match" as "partial
match", and retain some characters from the previous segment. So I think
the code could be something like this:
IF hard partial matching AND not anchored[1] AND no match THEN
Retain maximum lookbehind length[2] in current segment
Join next segment
Match again, with start_offset set to point to next segment
[1] PCRE_INFO_OPTIONS can be used to find if anchored.
[2] PCRE_INFO_MAX_LOOKBEHIND does not exist, but it could quite easily
be implemented. Until it is, you could just guess a suitable number.
Philip
--
Philip Hazel