[pcre-dev] Processing invalid UTF strings

Top Page

Reply to this message
Author: ph10
To: pcre-dev
Subject: [pcre-dev] Processing invalid UTF strings
A while ago Zoltan added support for searching in invalid UTF strings to
the JIT compiler. Recently I took a look at the pcre2_match()
interpreter and realized that it could do the same without too much
alteration and without affection normal matching. I have just committed
the patch that does this. There's a new pcre2_compile() option called
PCRE2_MATCH_INVALID_UTF that enables the new support. It is all
documented, but, exective summary: the subject can contain invalid UTF
sequences, but they can never form part of any match. All returned
matched strings are valid UTF. (See pcre2unicode for details.)

The tests have been updated but as this is new logic, it is quite likely
that there are edge cases that have been missed. If anybody feels keen,
please try this new feature out. Also, I may have missed parts of the
documentation that need updating. Thanks!


Philip Hazel