Re: [pcre-dev] Processing of invalid UTF-8 characters

Top Page
Delete this message
Author: Jean-Christophe Deschamps
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Processing of invalid UTF-8 characters

>This output is non-useful when the main application needs to analyze 
>input
>stream no matter what. To do this the main application now is forced to:
>    have its own built-in UTF8-parser;
>    reparse the input stream by this built-in parser to find invalid 
> UTF-8
>characters;
>    make them valid and remember changes to have possibility to 
> restore them
>later;
>    reexecute pcre_exec() with valid UTF-8 stream;
>    rebuild output stream with restoring of replaced invalid UTF-8
>characters.
>And cost of this work is very high.


And just why on Earth do you feel it's up to PCRE to carry the burden
for processing nonsensical input strings? If you happen to have to use
say SQLite or some other DB engine will you ask their devs to carry the
very same (useless) burden as well? And the OS as well? And what else?

That simply doesn't make any sense. Whatever the application is, it's
its responsability to conform to APIs as they are defined or place an
intermediate layer at this effect. PCRE is asking for either valid
UTF-8 or random strings of bytes and offers an option to catch invalid
input in both cases. IMHO this is very permissive already. I would
find it natural that a library like PCRE would specify an undefined
behavior (crash included) in case of invalid UTF-8 input. Checking
UTF-8 conformance and eventually correcting things is just not its
business.

It's the same as validation of user input, for example personal data
entered at a website. The data entry module needs to perform
validation once and reject until correct data is input. It would be
plain crazy to accept for instance random text as ZIP code, deeper
applications having to check that and correct the wrong data at their
sole level every time those applications need to process a ZIP code.

Write your own random-to-UTF8 code in a way which suits your needs and
let PCRE process pure UTF-8. That you can easily find invalid UTF-8
source is no excuse to rewrite gazillions of libraries with redundant
useless code in order to bear with it (with doubtful results anyway).

I wouldn't like to see valuable time and energy of benevolent PCRE devs
wasted to have PCRE able to process random binary data. If you need
that, I'm sure you can write such a module yourself.