Re: [pcre-dev] [Bug 1100] Crash backtracking over unicode se…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 1100
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1100] Crash backtracking over unicode sequence
On Tue, 19 Jul 2011, Tom Hughes wrote:

> We were actually using \X as a convenient approximation for extended grapheme
> cluster ;-)


I've now read (slightly) more carefully the Unicode documentation about
grapheme clusters. The raw definition is "a base followed by zero or
more continuing characters", which is sort of what one expects. However,
the definition of what the base and what the continuing characters are
has been expanded to be a lot more complicated than an "extended Unicode
sequence", which is what \X in PCRE matches.

It seems that Perl has been changed to match the full Unicode
definition. I do not propose to change PCRE at this time, partly because
I want to get the next release out reasonably soon, and partly because I
think it needs careful thought as to what to do[*], and more research to
understand the Unicode definition properly.

When you created the test you posted, which is, in pcretest notation:

/^S(\X*)e(\X*)$/8
Stéréo

did you expect it to match or not to match? (In case people's displays
mangle the subject string above, it consists of 8 Unicode characters,
encoded in 10 UTF-8 bytes. After each "e" there is the character U+0301
(acute accent), represented as the two hex bytes CC, 81.)

PCRE (now that I've fixed the bug) does not match. The first \X* matches
"t", the "e" is then matched, but the second \X* won't match a sequence
starting U+0301 because that is not a base character. Perl, however,
*does* match in this case. This could be argued to be because the
Unicode document says that "Degenerate cases include any isolated
non-base characters..." Hmm. Is that non-base character really
"isolated"?

This whole area is clearly a minefield. I imagine that if you converted
your e-acute representation to a single character, Perl would no longer
match it. On the other hand, PCRE would match /^Ste/ against that
string, so it isn't simple/consistent/straightforward either.

Anyway, as I said, I propose to leave \X alone in PCRE, at least for the
moment. It is documented as matching (?>\PM\pM*) which is, at least to
me, easier to understand the the whole complication of extended
grapheme clusters. :-)

Philip

[*] At present, the code for handling \X in all the various situations
(single, repeated-minimized, repeated-maximized, etc) is in-line. For
anything more complicated I suspect a subroutine is needed, which is
going to slow things down.

--
Philip Hazel