------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1416
Summary: UTF-8 lookbehinds match bytes instead of characters
Product: PCRE
Version: 8.32
Platform: Other
OS/Version: Windows
Status: NEW
Severity: bug
Priority: medium
Component: Code
AssignedTo: ph10@???
ReportedBy: danielklein@???
CC: pcre-dev@???
I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to
illustrate the bug.
<?php
print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
?>
--------
Actual output:
*ක
*�*�*�*
--------
Expected output:
*ක
*ම*
It appears to check every byte position within the UTF-8 encoded character and
backtrack until a valid starting byte is found, then check it. If you remove
the improperly inserted characters the resulting bytes do encode the original
character correctly.
Once it has matched the beginning of the string it should then move ahead by a
character, not by a byte. It is possible this is a PHP bug and not a PCRE bug
but I have no way of testing PCRE separately.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email