[pcre-dev] [Bug 1866] New: UTF-8 class containing \D and \P{…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1866] New: UTF-8 class containing \D and \P{Nd} matches incorrectly
https://bugs.exim.org/show_bug.cgi?id=1866

            Bug ID: 1866
           Summary: UTF-8 class containing \D and \P{Nd} matches
                    incorrectly
           Product: PCRE
           Version: 8.39
          Hardware: x86
                OS: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: justin.viiret@???
                CC: pcre-dev@???


We found an issue with this pattern, which has the PCRE_UTF8 flag set but not
PCRE_UCP:

    /[\D\P{Nd}]/8


pcretest -d shows this class being interpreted as the union of "all non-digit
characters up to \xff" and "all characters not in \p{Nd}":

$ ./pcretest -d
PCRE version 8.39 2016-06-14

re> /[\D\P{Nd}]/8

------------------------------------------------------------------
  0  43 Bra
  3     [\x00-/:-\xff\P{Nd}]
 43  43 Ket
 46     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf
No first char
No need char

data> 0

No match
data> _

0: _
data> \x{1d7cf}

No match

However, my reading of the pcrepattern documentation suggests that without the
PCRE_UCP flag, \D should be interpreted as "all non-digit characters up to \xff
and *all other characters*, meaning that the last test case above, U+1D7CF
(mathematical bold digit one) should match.

This is the case for this pattern if we use /\D/8 on its own, or if we
transform the pattern above into an alternation:

re> /\D|\P{Nd}/8
data> 0

No match
data> a

0: a
data> \x{1d7cf}

0: \x{1d7cf}

I have checked both PCRE 8.39 and PCRE 10.22 and they both show the same
behaviour.

--
You are receiving this mail because:
You are on the CC list for the bug.