------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=897
Summary: \w and others based on Unicode properties
Product: PCRE
Version: N/A
Platform: x86
OS/Version: Windows
Status: NEW
Severity: wishlist
Priority: medium
Component: Code
AssignedTo: ph10@???
ReportedBy: pavel@???
CC: pcre-dev@???
A quote from pcre documentation:
---
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
characters of any code value, but the characters that PCRE recognizes as
digits, spaces, or word characters remain the same set as before, all with
values less than 256. This remains true even when PCRE includes Unicode
property support, because to do otherwise would slow down PCRE in many common
cases. If you really want to test for a wider sense of, say, "digit", you must
use Unicode property tests such as \p{Nd}. Note that this also applies to \b,
because it is defined in terms of \w and \W.
---
I do appreciate concern for speed in pcre.
However, having to deal with international characters almost constantly, I
would really appreciate something like a compile-time option (for compiling
pcre) to force it into using Unicode properties always.
I cannot just replace all the "\b" with complex constructions based on \p{},
since I don't write patterns myself - end-users do it. And parsing their
patterns just to make correct replacement doesn't look appealing to me either.
At least, I would greatly appreciate a hint on where should I look in pcre
sources to try and change this behaviour myself.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email