Re: [pcre-dev] Question regarding matching invalid unicode

Author: ph10
Date:
To: Kilian Kilger
CC: pcre-dev
Subject: Re: [pcre-dev] Question regarding matching invalid unicode

On Fri, 14 Feb 2020, Kilian Kilger via Pcre-dev wrote:

> we try to use PCRE2 to match UCS-2 encoding, i.e. UTF-16 without any
> check for "broken" surrogates or any other invalid unicode. In UCS-2
> encoding every character is 2 bytes and every 2-byte sequence is
> accepted as a valid character.
> Nevertheless we want unicode char properties to be considered, when
> the code point at the corresponding position is valid UTF-16.
>
> We recognized that the seem to get this behaviour, when we set the following:
>
> PCRE2_UCP | PCRE2_NEVER_UTF | PCRE2_NO_UTF_CHECK
>
> but we *do not set*
>
> PCRE2_UTF
>
> Is this correct? Does this have negative implications or undefined behaviour?

It is correct, but you do not need to set PCRE2_NEVER_UTF or
PCRE2_NO_UTF_CHECK, although they do no harm. The former disallows
PCRE2_UTF and (*UTF) within the pattern, and the latter is only useful
if PCRE2_UTF or (*UTF) is set. Here's a test case that demonstrates that
this works:

$ pcre2test -16 zz
PCRE2 version 10.34 2019-11-21
/\p{Arabic}\d./ucp
\x{600}\x{660}\x{D800}
0: \x{600}\x{660}\x{d800}

U+0600 is an Arabic character
U+0660 is an Arabic digit - this won't be recognized as a digit if PCRE2_UCP
         is not set
U+D800 is a surrogate character

You can even check for surrogate character codes:

PCRE2 version 10.34 2019-11-21
/\p{Cs}/ucp
\x{D800}
0: \x{d800}

I hope this helps.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Kilian Kilger at