Author: Philip Hazel Date: To: ND CC: Pcre-dev Subject: Re: [pcre-dev] Processing of invalid UTF-8 characters
On Fri, 11 Feb 2011, ND wrote:
> Hi, Philip!
>
> The chapter "Validity of UTF-8 strings" of PCRE documentation says:
> "if the string does not even conform to RFC 2279, the result is undefined.
> Your program may crash."
... but ONLY if you have specified PCRE_NO_UTF8_CHECK, which is not the
default. That option is intended for the situation when you process the
same string several times and do not need to check it more than once. In
the default case, PCRE does not crash: it returns PCRE_ERROR_BADUTF8.
> IMHO, it's not a practically useful behaviour.
PCRE supports only valid UTF-8. If you tell it a string is valid (by
setting PCRE_NO_UTF8_CHECK) when in fact it is not a valid string, you
cannot expect sensible behaviour.
> There are a lot of utf-8 Web pages that have some non-utf-8 characters.
> But no one browser crashes when analyzing this pages.
It's a performance issue. PCRE would be slower if it had to take
precautions against broken UTF-8 strings. I chose not to penalize it in
that way. In a recent release the code for "load next character" was
rewritten, following a suggestion by a user, to avoid looping and make
it faster, so this is an issue of real concern.
> I think it useful to learn PCRE to successfully (without program crash and
> without PCRE_ERROR_BADUTF8 breaker) process such erroneous utf-8 streams.
It would be better to write a separate function to analyze such bad
strings and to make a guess (and it would be a guess) as to what they
might actually mean, and then generate valid UTF-8 for PCRE to process.
> PCRE can assign to invalid bytes a special character property "invalid".
> And this bytes will successfully matched by ".", "\p{xx}" etc.
> This behaviour may be optional.
>
> What do you think about that?
I think it would penalize the normal running of PCRE too much. I also
think one could argue about how to interpret a sequence of invalid bytes
whose values are greater than 127. How many characters does such a
string encode? For example, suppose the first byte indicates that there
are three more bytes in a UTF-8 character, two of them are OK, but the
third one has an invalid value (less than 128, say). Is that a mangled
UTF-8 character followed by an ASCII byte, or is it four single-byte
characters?