Re: [pcre-dev] Processing of invalid UTF-8 characters

Top Page
Delete this message
Author: ND
Date:  
To: Pcre-dev
Subject: Re: [pcre-dev] Processing of invalid UTF-8 characters
> In the default case, PCRE does not crash: it returns PCRE_ERROR_BADUTF8.
This output is non-useful when the main application needs to analyze input
stream no matter what. To do this the main application now is forced to:
have its own built-in UTF8-parser;
reparse the input stream by this built-in parser to find invalid UTF-8
characters;
make them valid and remember changes to have possibility to restore them
later;
reexecute pcre_exec() with valid UTF-8 stream;
rebuild output stream with restoring of replaced invalid UTF-8
characters.
And cost of this work is very high.

Situations when analyzis must be successfully dealed regardles erroneous
or not is input UTF-8 stream are widespread. The reason of error
appearance in some cases is unwitting or wilful in other. Now PCRE can't
offer effective solution.

> I think it would penalize the normal running of PCRE too much.

I wrote that this behaviour may be OPTIONAL.

> I also think one could argue about how to interpret a sequence of
> invalid byteswhose values are greater than 127. How many characters does
> such astring encode? For example, suppose the first byte indicates that
> thereare three more bytes in a UTF-8 character, two of them are OK, but
> thethird one has an invalid value (less than 128, say). Is that a mangled
> UTF-8 character followed by an ASCII byte, or is it four single-byte
> characters?

IMHO a sequence of invalid bytes may be interprets as one character of
type "invalid" per byte.