https://bugs.exim.org/show_bug.cgi?id=1718
Bug ID: 1718
Summary: Code and documentation differ on definition of POSIX
[:punct:] class
Product: PCRE
Version: 8.37
Hardware: x86
OS: Windows
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: justin.viiret@???
CC: pcre-dev@???
The "pcrepattern" man page states that for UCP mode, PCRE interprets [:punct:]
as follows:
[:punct:] This matches all characters that have the Unicode P (punctuaâ
tion) property, plus those characters whose code points are
less than 128 that have the S (Symbol) property.
However, in the code (pcre_xlass.c, from line 243) it appears that the test is
slightly different:
/* Punctuation: all Unicode punctuation, plus ASCII characters that
Unicode treats as symbols rather than punctuation, for Perl
compatibility (these are $+<=>^`|~). */
case PT_PXPUNCT:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_P ||
(c < 256 && PRIV(ucp_gentype)[prop->chartype] == ucp_S)) == isprop)
return !negated;
break;
I think the test above should check if c < 128 for code points in S as per the
docs, rather than 256.
Here is a test case. The character U+00B4 ("acute accent") is in category Sk
(Symbol, modifier) and should not match against /[[:punct:]]/8W, but it does.
--------
$ bin/pcretest
PCRE version 8.37 2015-04-28
re> /[[:punct:]]/8W
data> \xc2\xb4
0: \x{b4}
--------
Perl 5.20.2 does not produce a match for this case.
I checked with PCRE2, and it interprets this pattern the same way.
--
You are receiving this mail because:
You are on the CC list for the bug.