[pcre-dev] [Bug 1416] New: UTF-8 lookbehinds match bytes ins…

Top Page
Delete this message
Author: CJ Dennis
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1416] New: UTF-8 lookbehinds match bytes instead of characters
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416
           Summary: UTF-8 lookbehinds match bytes instead of characters
           Product: PCRE
           Version: 8.32
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: danielklein@???
                CC: pcre-dev@???



I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to
illustrate the bug.

<?php
print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
?>

--------
Actual output:
*ක
*�*�*�*

--------
Expected output:
*ක
*ම*

It appears to check every byte position within the UTF-8 encoded character and
backtrack until a valid starting byte is found, then check it. If you remove
the improperly inserted characters the resulting bytes do encode the original
character correctly.

Once it has matched the beginning of the string it should then move ahead by a
character, not by a byte. It is possible this is a PHP bug and not a PCRE bug
but I have no way of testing PCRE separately.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email