[pcre-dev] Processing of invalid UTF-8 characters

Top Page
Delete this message
Author: ND
Date:  
To: Pcre-dev
Subject: [pcre-dev] Processing of invalid UTF-8 characters
Hi, Philip!

The chapter "Validity of UTF-8 strings" of PCRE documentation says:
"if the string does not even conform to RFC 2279, the result is undefined.
Your program may crash."
IMHO, it's not a practically useful behaviour.
There are a lot of utf-8 Web pages that have some non-utf-8 characters.
But no one browser crashes when analyzing this pages.
I think it useful to learn PCRE to successfully (without program crash and
without PCRE_ERROR_BADUTF8 breaker) process such erroneous utf-8 streams.

PCRE can assign to invalid bytes a special character property "invalid".
And this bytes will successfully matched by ".", "\p{xx}" etc.
This behaviour may be optional.

What do you think about that?

Thanx.