Re: [pcre-dev] Processing of invalid UTF-8 characters

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: ND, Herczeg Zoltán, Jean-Christophe Deschamps
CC: Pcre-dev
Subject: Re: [pcre-dev] Processing of invalid UTF-8 characters
On Fri, 11 Feb 2011, ND wrote:

>    reparse the input stream by this built-in parser to find invalid UTF-8  
> characters;


I am still considering whether PCRE could return the offset of the bad
character for PCRE_ERROR_BADUTF8, but even if it did, it would return
only the first such character (because it stops as soon as it finds
one).

> > I think it would penalize the normal running of PCRE too much.
> I wrote that this behaviour may be OPTIONAL.


If it is optional at runtime, then the option has to be tested *for
every character loaded*, which is too much. This could be avoided by
making it optional build time (by changing the macros, as suggested by
Zoltán below).

On Fri, 11 Feb 2011, Herczeg Zoltán wrote:

> Anyway, PCRE have macros in pcre_internal.h:
>
> GETCHAR(c, eptr) GETCHARTEST(c, eptr) GETCHARINC(c, eptr)
> GETCHARINCTEST(c, eptr) GETCHARLEN(c, eptr, len)
> GETCHARLENTEST(c, eptr, len) BACKCHAR(eptr)
>
> You can redefine them to fit for your purposes including handling illegal characters.


Yes, indeed.

On Fri, 11 Feb 2011, Jean-Christophe Deschamps wrote:

> I wouldn't like to see valuable time and energy of benevolent PCRE devs
> wasted to have PCRE able to process random binary data. If you need
> that, I'm sure you can write such a module yourself.


A general "check and clean this data" application could be useful in
many situations, and that is another reason why it should not be part of
PCRE. Thanks for adding weight to my arguments!

Philip

--
Philip Hazel