[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead …

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416




--- Comment #1 from Philip Hazel <ph10@???> 2013-11-19 10:35:14 ---
On Tue, 19 Nov 2013, CJ Dennis wrote:

> I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to
> illustrate the bug.
>
> <?php
> print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
> print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
> ?>


In my mail reader, and in the message that I saved, those two lines are
identical (apart from the comment). There appear to be no UTF-8
characters.

I'm afraid I am not a PHP user. What exactly are those lines supposed to
match? To me it looks like "in the string '*', find a point where the
previous character is not a space, and then insert a space".

> It appears to check every byte position within the UTF-8 encoded character and
> backtrack until a valid starting byte is found, then check it. If you remove
> the improperly inserted characters the resulting bytes do encode the original
> character correctly.


It should treat UTF-8 characters as character *if PCRE is set into UTF
mode*. I presume that's what /u is trying to do in PHP? At the PCRE
level, it means setting the PCRE_UTF8 option bit.

> Once it has matched the beginning of the string it should then move ahead by a
> character, not by a byte. It is possible this is a PHP bug and not a PCRE bug
> but I have no way of testing PCRE separately.


If you have a standard PCRE install, you should have the pcretest
program. This should show you where PCRE matches, though it does not
have a replace function. In order for any potential bug to be fixed, I
need to be able to reproduce it using pcretest.

I suspect an issue in the interface between PHP and PCRE because PCRE
has been handling UTF-8 for a long time now, and a problem of this kind
has not previously been reported. I am, however, always prepared to be
proved wrong.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email