https://bugs.exim.org/show_bug.cgi?id=1866
Bug ID: 1866
Summary: UTF-8 class containing \D and \P{Nd} matches
incorrectly
Product: PCRE
Version: 8.39
Hardware: x86
OS: All
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: justin.viiret@???
CC: pcre-dev@???
We found an issue with this pattern, which has the PCRE_UTF8 flag set but not
PCRE_UCP:
/[\D\P{Nd}]/8
pcretest -d shows this class being interpreted as the union of "all non-digit
characters up to \xff" and "all characters not in \p{Nd}":
$ ./pcretest -d
PCRE version 8.39 2016-06-14
re> /[\D\P{Nd}]/8
------------------------------------------------------------------
0 43 Bra
3 [\x00-/:-\xff\P{Nd}]
43 43 Ket
46 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf
No first char
No need char
data> 0
No match
data> _
0: _
data> \x{1d7cf}
No match
However, my reading of the pcrepattern documentation suggests that without the
PCRE_UCP flag, \D should be interpreted as "all non-digit characters up to \xff
and *all other characters*, meaning that the last test case above, U+1D7CF
(mathematical bold digit one) should match.
This is the case for this pattern if we use /\D/8 on its own, or if we
transform the pattern above into an alternation:
re> /\D|\P{Nd}/8
data> 0
No match
data> a
0: a
data> \x{1d7cf}
0: \x{1d7cf}
I have checked both PCRE 8.39 and PCRE 10.22 and they both show the same
behaviour.
--
You are receiving this mail because:
You are on the CC list for the bug.