Re: [pcre-dev] Processing of invalid UTF-8 characters

Pàgina inicial
Delete this message
Autor: Herczeg Zoltán
Data:  
A: ND
CC: Pcre-dev
Assumpte: Re: [pcre-dev] Processing of invalid UTF-8 characters
Hi,

I think the cost of an early check is still lower than doing this during pcre_exec().

Anyway, PCRE have macros in pcre_internal.h:

GETCHAR(c, eptr) GETCHARTEST(c, eptr) GETCHARINC(c, eptr)
GETCHARINCTEST(c, eptr) GETCHARLEN(c, eptr, len)
GETCHARLENTEST(c, eptr, len) BACKCHAR(eptr)

You can redefine them to fit for your purposes including handling illegal characters.

Regards,
Zoltan

ND <nadenj@???> írta:
>> In the default case, PCRE does not crash: it returns PCRE_ERROR_BADUTF8.>

This output is non-useful when the main application needs to analyze input >
stream no matter what. To do this the main application now is forced to:>
have its own built-in UTF8-parser;>
reparse the input stream by this built-in parser to find invalid UTF-8 >
characters;>
make them valid and remember changes to have possibility to restore them >
later;>
reexecute pcre_exec() with valid UTF-8 stream;>
rebuild output stream with restoring of replaced invalid UTF-8 >
characters.>
And cost of this work is very high.>
>

Situations when analyzis must be successfully dealed regardles erroneous >
or not is input UTF-8 stream are widespread. The reason of error >
appearance in some cases is unwitting or wilful in other. Now PCRE can't >
offer effective solution.>
>
> I think it would penalize the normal running of PCRE too much.>

I wrote that this behaviour may be OPTIONAL.>
>
> I also think one could argue about how to interpret a sequence of >
> invalid byteswhose values are greater than 127. How many characters does >
> such astring encode? For example, suppose the first byte indicates that >
> thereare three more bytes in a UTF-8 character, two of them are OK, but >
> thethird one has an invalid value (less than 128, say). Is that a mangled >
> UTF-8 character followed by an ASCII byte, or is it four single-byte>
> characters?>

IMHO a sequence of invalid bytes may be interprets as one character of >
type "invalid" per byte.>
>

-- >
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev >