Re: [pcre-dev] several messages

Author: Philip Hazel
Date:
To: ND
CC: Pcre-dev
Subject: Re: [pcre-dev] several messages

On Tue, 24 Jan 2012, ND wrote:

> May be it can be useful to combine our approaches to reduce the number of
> cases when zero-length partial match may occur?
> So, zero-length partial match is allowed when:
> 1. it arises within a lookahead
> AND
> 2. pattern have lookbehind with non-zero length

I knew I had to think about this more. My idea was wrong. You don't need
lookahead assertions. Consider /b(?<=..)/ and "abc". There are probably
many different examples one can construct. The key thing, of course, is
the lookbehind.

The documentation for partial matching says this:

2. Lookbehind assertions at the start of a pattern are catered for in
the offsets that are returned for a partial match. However, in theory,
a lookbehind assertion later in the pattern could require even earlier
characters to be inspected, and it might not have been reached when a
partial match occurs. This is probably an extremely unlikely case; you
could guard against it to a certain extent by always including extra
characters at the start.

Obviously, this situation isn't as "extremely unlikely" as I thought...

I have had the following ideas:

An unanchored pattern could always give a zero-length partial match at
the end of the string. I chose not to do this, because at the time I did
not think it was ever useful. It turns out that it is useful, but only
if there is a lookbehind later in the pattern.

However, this means that "no match" actually means "zero-length partial
match at end of string". What does an application that is doing
multi-segment matching do with these results?

No match:       get next segment and start matching at the start.
Partial match:  get next segment and start matching at the start.

In other words, *the same thing*. However, what matters is whether any
of the previous segment is retained, in case there is a lookbehind.

PCRE could tell the application the maximum lookbehind length, but what
it cannot tell is whether there is a lookbehind further along the path
that is being matched. So I don't think it should change its result.
However, an application can choose to treat "no match" as "partial
match", and retain some characters from the previous segment. So I think
the code could be something like this:

  IF hard partial matching AND not anchored[1] AND no match THEN
    Retain maximum lookbehind length[2] in current segment
    Join next segment
    Match again, with start_offset set to point to next segment

[1] PCRE_INFO_OPTIONS can be used to find if anchored.
[2] PCRE_INFO_MAX_LOOKBEHIND does not exist, but it could quite easily
    be implemented. Until it is, you could just guess a suitable number.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	ND at
	ND at