https://bugs.exim.org/show_bug.cgi?id=1719
Bug ID: 1719
Summary: Class containing negated POSIX classes with other
classes match incorrectly
Product: PCRE
Version: 8.37
Hardware: x86
OS: Linux
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: justin.viiret@???
CC: pcre-dev@???
I have another bug related to POSIX character classes; this one is a little
more involved than the ones I filed yesterday. These were all found with
fuzzer-driven comparison testing against Intel's Hyperscan regex engine.
We have found that PCRE differs from Perl's behaviour in PCRE_UCP mode for some
character classes composed of more than one class (POSIX class, mnemonic like
\d, etc) where one of those classes is a negated POSIX class that does not
become a Unicode property in UCP mode, for example [:^ascii:] or [:^xdigit:].
As a concrete example, it appears that /[[:^ascii:]]/8W behaves effectively as
/[\x{80}-\x{10ffff}]/8, whereas /[[:^ascii:]\w]/8W will not match some non-word
code points. It looks like the latter class is matching against a smaller set
of characters than the former -- perhaps this is some interaction with the
negation at pcre_compile time?
Here is a test case in pcretest format:
/[[:^ascii:]\w]/8W
a
9
g
\x{100}
\x{200}
\x{300}
\x{37e}
This fails to match the last two cases(\x{300} and \x{37e}) in pcretest, but
returns a match for all test cases in perltest.pl (with 'use utf8; require
Encode;' uncommented). I see similar results for /[[:^xdigit:][:space:]]/8W
with the same test input data.
I have checked with PCRE2 10.20, and it exhibits the same behaviour.
--
You are receiving this mail because:
You are on the CC list for the bug.