Re: [pcre-dev] match point reset bug?

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Sheri
CC: pcre-dev
Subject: Re: [pcre-dev] match point reset bug?
On Sat, 12 Sep 2009, Sheri wrote:

> Interesting, thanks for the detailed explanation. Seems odd however that
> a lookbehind version works in 7.9?
>
> re> /(?<=abc)|(?<=def)/g
> data> abcdefghi
> 0:
> 0:
> data>


Lookbehind is different! It finds the empty string at the point at which
it is starting the match. That is, in your example above , the match
"bumpalong point" (to use Friedl's terminology) is 3. But if you use

/abc\K|def\K/g

the match happens when the bumpalong point is 0. The lookbehind version
does indeed work in 7.9. Having found the empty match at offset 3, it
fails to find a non-empty match, and so moves on by one character.
Matches then fail until it reaches offset 6, at which point it finds the
def match.

With \K, however, doing the same thing doesn't work. Having found the
empty match, it looks for a non-empty match starting at offset 3, and
fails. So it moves to offset 4, and so misses the next match. That's why
it has to ignore only the empty match at offset 3 itself, not one that
is found later in the string. Is that clear? I know it's tricky!

> I understand why you are making another option, but it sounds like as a
> result all user apps that do multiple matching (and the C++ module) will
> need to be modified to benefit. In fact if using a shared library, it
> will need to be processed one way if using a version less than 8.0 and
> another if using 7.9 and earlier.


Aarrgghh. I had not thought of that. What I *had* thought of was that
people might be using PCRE_NOTEMPTY for completely different purposes,
and I did not want to break their applications.

> Have you considered giving the new option value to the old functionality?


Oh dear, oh dear, this looks like I have to make a judgement as to which
change will annoy the fewest people, remembering that the problem arises
only if \K is used in a pattern that matches an empty string. Something
like

/ab\Kc/de\Kf/

works fine in 7.9. I take your point about shared libraries etc. I am
now in a quandary as to what it the best way to proceed.

Anybody else on this list got any ideas? If I do as Sheri suggests, and
give PCRE_NOTEMPTY_ATSTART the bit value that was PCRE_NOTEMPTY, and
give PCRE_NOTEMPTY a new value, pre-compiled applications that use a
shared library would automatically move to the new functionality, but
any that actually wanted the PCRE_NOTEMPTY functionality would go wrong.

[At this point, I went away and played Mozart (string quartet) for an
hour. Helps clear the brain...]

Hmm... having thought about this for a bit, I think I am going to stick
with things the way they are. This is my reasoning:

* If I change the functionality of the existing options bit, programs
that use PCRE_EMPTY for completely independent purposes (nothing to do
with /g-style processing) will suddenly break on a PCRE upgrade.

* On the other hand, programs which at present use PCRE_EMPTY for
/g-style processing are not working properly in the presence of \K
patterns that match empty strings, a relatively rare situation (at
least, that's my guess).

It seems to me to be better to choose the action that allows people to
improve the behaviour of their programs that are not working quite right
over an action that breaks programs that are working well.

I take the point about PCRE versions. But if programmers are changing
their programs anyway, and want to remain compatible with previous
releases, it isn't too hard to do something like

#if PCRE_MAJOR >= 8
options |= PCRE_NOTEMPTY_ATSTART
#else
options |= PCRE_NOTEMPTY
#endif

which, in any case, is exactly the same as what you have to do for any
new option that is added.

Philip

--
Philip Hazel