[pcre-dev] [Bug 2120] PCRE2_NO_UTF_CHECK does not disable al…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2120] PCRE2_NO_UTF_CHECK does not disable all checks
https://bugs.exim.org/show_bug.cgi?id=2120

--- Comment #7 from Philip Hazel <ph10@???> ---
You wrote "Checking the surrogate range seems too benign to be error-worthy."
The problem is that, in UTF-16, it is impossible to encode a character value
(code point) within that range because such code unit values are used in pairs
to encode characters in the range 0x10000 to 0x10ffff (and that's the reason
Unicode code points stop at 0x10ffff - UTF-8 can encode right up to
0x7fffffff).

PCRE2 compiles patterns into an internal form in which the code points are in
the same format as the pattern. So a UTF-16 pattern compiles escapes such as
\x{1234} into appropriate UTF-16 code unit sequences. It cannot do this for
values in the surrogate range, and so it has no option but to throw an error.
For consistency, the surrogate values are not allowed in UTF-8 or UTF-32
either.

If a subject string has been checked by pcre2_match() for UTF-8 correctness,
the only occurrences of surrogates must be in valid pairs. If the test is
really for characters with code points greater than 0xffff, the check should be
\x{10000}-\zx{10ffff} of course.

The pattern you quote would work happily in non-UTF 16-bit mode in PCRE2 when
applied to a UTF-16 string, matching the listed code points and any surrogates,
thus ensuring no characters greater than 0xffff when read as UTF-16. Perhaps
that is what was intended? (All the hits I get from your google search seem
quite old, incidentally.)

Let's think of what might be done. I got rid of JAVASCRIPT_COMPAT partly
because it implied *complete* compatibility and partly to avoid having a
specific name (and not having to add ECMASCRIPT_COMPAT as a synonym.) Instead,
I added options whose names, I hope, suggest exactly what they do. If an option
to allow surrogate values were implemented, its name would have to be something
like ALLOW_SURROGATES_IN_8_AND_32_BIT_MODES, because, as explained above, it
can't work in UTF-16. The other worry is that there are only 4 free bits in
pcre2_compile() options, and I think this is sufficiently special-case that it
doesn't really deserve one of them. My fault. I should have specified two
options words. However, a "rare options" word could be added to the compile
context. Would this be acceptable?

--
You are receiving this mail because:
You are on the CC list for the bug.