[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416

--- Comment #4 from Philip Hazel <ph10@???> 2013-11-20 09:23:29 ---
On Tue, 19 Nov 2013, CJ Dennis wrote:

> @Graycode
> The equivalent English regex would be:
> <?php
> print(preg_replace('/(?<!k)/u', '*', 'k') . "\n"); // /u is unnecessary but
> harmless as no UTF-8 characters
> print(preg_replace('/(?<!k)/u', '*', 'm') . "\n");
> ?>

Thanks, Graycode, for analyzing this much better than I did. I should
have saved the message in raw form to see what the characters actually
were (my xterm screen on Linux showed just spaces). Now that I've done
that, of course I see the actual bytes.

> I'm assuming therefore that PCRE is behaving correctly (assuming no bug
> has crept in between 7.0 and 8.32).

Thanks for the test. I have now constructed my own test, both with
actual UTF-8 bytes and using escapes, and checked that it is OK both
with 8.33 and with the forthcoming 8.34. This is the output:

PCRE version 8.34-RC1 2013-11-19

/(?<!(\x{D9A}))/g8
\x{D9A}
0:
\x{DB8}
0:
0:

/(?<!(ක))/g8
ක
0:
ම
0:
0:

Incidentally, if you ever need a Windows binary for pcretest again, note
that a PCRE user has the latest releases of pcretest and pcregrep
available for download here:

http://www.rexegg.com/pcregrep-pcretest.html

Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	CJ Dennis at
	Graycode at

[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead …