[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Author: Graycode
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416

--- Comment #2 from Graycode <xgcode@???> 2013-11-19 22:04:30 ---
I barely know enough about this topic to be dangerous, so please reply
about any inaccuracy as I try to decypher this.

The C code which implements most of PHP's use of PCRE can be found at:
http://faculty.washington.edu/brandl/phparchive/php-5.2.3/ext/pcre/php_pcre.c
There we see for example that PHP's "/u" flag means PCRE_UTF8.
(And related to bug# 1412 it also shows the setlocale() and use of
pcre_maketables().)

Now as for this issue posted by "CJ Dennis":

I have no clue if PHP accepts "\x" notation, but ...
Showing the UTF8 characters in hex notation, the equivalent PHP
statements would be:

<?php
print(preg_replace('/(?<!\xE0\xB6\x9A)/u', '*', '\xE0\xB6\x9A') . "\n"); //
Works properly
print(preg_replace('/(?<!\xE0\xB6\x9A)/u', '*', '\xE0\xB6\xB8') . "\n"); //
Triggers the bug
?>

\xE0\xB6\x9A is UTF8 value U+0D9A meaning "SINHALA LETTER ALPAPRAANA KAYANNA"
http://www.fileformat.info/info/unicode/char/0d9a/index.htm

\xE0\xB6\xB8 is UTF8 value U+0DB8 meaning "SINHALA LETTER MAYANNA"
http://www.fileformat.info/info/unicode/char/0db8/index.htm

The 2 lines of "Actual output" shown as their hex bytes is:
2A E0 B6 9A
2A EF BF BD 2A EF BF BD 2A EF BF BD 2A

Note that the 2nd line portions "EF BF BD" is UTF8 value U+FFFD meaning
"REPLACEMENT CHARACTER". For many UTF conversions, this value
indicates that a UTF conversion is invalid or otherwise not supported.
Unfortunately I have no way to know whether a UTF conversion issue
happened in the exim.org website when this issue was reported if it
happens within PHP at CJ's site. I hope CJ will please clarify this.

The 2 lines of "Expected output" shown as their hex bytes is:
2A E0 B6 9A
2A E0 B6 B8 2A

I don't know why there seems to be an extra 2A (asterisk) at the end
of the 2nd one.

In summary, CJ is showing that PHP produced an unexpected result for
the 2nd match of the example. Here I'm not going to try to verify,
prove, or disprove the claim because there is several unknowns.

I will say though that I think an expression consisting of only a
Negative Look-Behind Assertion may always match empty at position 0 of
a subject, and PCRE may (or should?) always do that in the same manner
regardless of the subject.

The following is a PCRETEST equivalent file for match conditions:

/-- An attempted interpretation of the reported bug# 1416 --/

/(?<!\xE0\xB6\x9A)/8
\xE0\xB6\x9A
\xE0\xB6\xB8
anything, and next subject is (empty) nothing
\

/-- End of test --/

Regards,
Graycode

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	CJ Dennis at

[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead …