Re: [pcre-dev] [Bug 1416] New: UTF-8 lookbehinds match bytes…

Top Page
Delete this message
Author: ph10
Date:  
To: 1416
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1416] New: UTF-8 lookbehinds match bytes instead ofcharacters
On Tue, 19 Nov 2013, CJ Dennis wrote:

> I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to
> illustrate the bug.
>
> <?php
> print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
> print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
> ?>


In my mail reader, and in the message that I saved, those two lines are
identical (apart from the comment). There appear to be no UTF-8
characters.

I'm afraid I am not a PHP user. What exactly are those lines supposed to
match? To me it looks like "in the string '*', find a point where the
previous character is not a space, and then insert a space".

> It appears to check every byte position within the UTF-8 encoded character and
> backtrack until a valid starting byte is found, then check it. If you remove
> the improperly inserted characters the resulting bytes do encode the original
> character correctly.


It should treat UTF-8 characters as character *if PCRE is set into UTF
mode*. I presume that's what /u is trying to do in PHP? At the PCRE
level, it means setting the PCRE_UTF8 option bit.

> Once it has matched the beginning of the string it should then move ahead by a
> character, not by a byte. It is possible this is a PHP bug and not a PCRE bug
> but I have no way of testing PCRE separately.


If you have a standard PCRE install, you should have the pcretest
program. This should show you where PCRE matches, though it does not
have a replace function. In order for any potential bug to be fixed, I
need to be able to reproduce it using pcretest.

I suspect an issue in the interface between PHP and PCRE because PCRE
has been handling UTF-8 for a long time now, and a problem of this kind
has not previously been reported. I am, however, always prepared to be
proved wrong.

Philip

--
Philip Hazel