[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead …

Top Page
Delete this message
Author: CJ Dennis
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416

CJ Dennis <danielklein@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID





--- Comment #3 from CJ Dennis <danielklein@???> 2013-11-19 23:46:12 ---
@Graycode
The equivalent English regex would be:
<?php
print(preg_replace('/(?<!k)/u', '*', 'k') . "\n"); // /u is unnecessary but
harmless as no UTF-8 characters
print(preg_replace('/(?<!k)/u', '*', 'm') . "\n");
?>

--------
Actual output:
*k
*m*

@Philip Hazel
I could see the characters correctly in your reply. The regex is looking for
any position where the Sinhala letter for 'k' is not before and insert a '*'.
the string to search in is the third argument (Sinhala 'k' and Sinhala 'm').
Yes, /u is to turn on UTF-8 mode, otherwise PHP treats the pattern as ASCII
(albeit with 8 bits).

You seem to be correct that it's not a PCRE bug. I downloaded pcre-7.0.exe (I'm
using Windows), read the pcre-man.pdf file and constructed a test file with the
regex and patterns in it. I got one match for the first string and two matches
for the second. If the bug was present I would expect to see four matches, not
two. I'm assuming therefore that PCRE is behaving correctly (assuming no bug
has crept in between 7.0 and 8.32).

I will report this bug on the PHP site. Thanks for your time!

By the way the input file I used was:
--------
/(?<!(ක))/g8


--------

and the output was:
--------
PCRE version 7.0 18-Dec-2006

/(?<!(ක))/g8

0:

0:
0:

--------


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email