On Tue, 19 Nov 2013, CJ Dennis wrote:
> I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to
> illustrate the bug.
>
> <?php
> print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
> print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
> ?>
In my mail reader, and in the message that I saved, those two lines are
identical (apart from the comment). There appear to be no UTF-8
characters.
I'm afraid I am not a PHP user. What exactly are those lines supposed to
match? To me it looks like "in the string '*', find a point where the
previous character is not a space, and then insert a space".
> It appears to check every byte position within the UTF-8 encoded character and
> backtrack until a valid starting byte is found, then check it. If you remove
> the improperly inserted characters the resulting bytes do encode the original
> character correctly.
It should treat UTF-8 characters as character *if PCRE is set into UTF
mode*. I presume that's what /u is trying to do in PHP? At the PCRE
level, it means setting the PCRE_UTF8 option bit.
> Once it has matched the beginning of the string it should then move ahead by a
> character, not by a byte. It is possible this is a PHP bug and not a PCRE bug
> but I have no way of testing PCRE separately.
If you have a standard PCRE install, you should have the pcretest
program. This should show you where PCRE matches, though it does not
have a replace function. In order for any potential bug to be fixed, I
need to be able to reproduce it using pcretest.
I suspect an issue in the interface between PHP and PCRE because PCRE
has been handling UTF-8 for a long time now, and a problem of this kind
has not previously been reported. I am, however, always prepared to be
proved wrong.