[pcre-dev] Support for invalid UTF-8 strings?

Top Page
Delete this message
Author: Milan Bouchet-Valat
Date:  
To: pcre-dev
Subject: [pcre-dev] Support for invalid UTF-8 strings?
Dear PCRE developers,

I'm writing on behalf of the Julia programming language [1] developers
in order to get some information regarding the handling of invalid UTF-
8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. For the
context, Julia has taken the stance that strings are stored in UTF-8
but are not required to contain valid UTF-8. This is needed to be able
to work with any contents, like a filename or an invalid text file,
without throwing errors not modifying the input data. This approach is
similar to that adopted by Go (see [2] for an example of problems which
arise when strings are required to be valid Unicode as in Python 3).

Of course, this stance is more complex to hold in the context of
regular expression matching. The PCRE documentation very clearly states
that both the regular expression and the string must be valid Unicode
when PCRE2_UTF is set, and that behavior is undefined if that's not the
case and PCRE2_NO_UTF_CHECK is also set. However, we have been
wondering whether it would be possible to allow the string (not the
regex) to contain invalid UTF-8 when PCRE2_NO_UTF_CHECK is set. In such
a situation, invalid sequences would simply be treated as series of
one-byte "characters" for which all Unicode predicates would be false,
and returned as-is (see [3]). This is how Julia treats invalid UTF-8
strings and it appears to work well. By default, valid UTF-8 would
still be required, but instead of declaring the behavior as undefined
when the string is invalid and PCRE2_NO_UTF_CHECK is set, a well-
defined behavior would be implemented.

Let me stress that we do not suggest supporting invalid regexes, as it
appears difficult to give them a clear and meaningful definition. We
are also aware that we could avoid setting PCRE2_UTF, but the resulting
behavior would not match what is generally expected for strings which
are supposed to contain (possibly invalid) Unicode text.

Do you think such a behavior would make sense? Could it be implemented
without dramatically impacting performance? Julia could use a custom
patch if this feature is not deemed useful for PCRE.

Thanks in advance for your help


1: http://julialang.org
2: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
3: https://github.com/JuliaLang/julia/pull/26731#issuecomment-379580049