[pcre-dev] Support invalid UTF subject strings by PCRE2-JIT

Top Page

Reply to this message
Author: Zoltán Herczeg
To: pcre-dev
Subject: [pcre-dev] Support invalid UTF subject strings by PCRE2-JIT
Dear PCRE2 users,

since PCRE 10.32 has been released, it is time for announcing a new major feature for PCRE2-JIT: supporting invalid UTF subject strings. This feature can be enabled by passing PCRE2_JIT_INVALID_UTF option to pcre2_jit_compile(). It is recommended to use pcre2_jit_match() after the pattern is compiled.

Regular expressions (regardless they are traditional automata based or newer pattern matching script languages) are designed to search character sequences in a textual input where each character has an attribute list and these attributes can be used to control the matching process. For example patterns can be constructed to search lowercase Greek words or full Latin sentences.

Currently Unicode is the most popular encoding for written texts. Unicode characters are called code points and the Unicode standard provides a long list of attributes for each code point. The UTF (Unicode Transformation Format) has been created to encode these code points as byte sequences. However this encoding does not use all possible byte values, so a random binary input may contain bytes which cannot be decoded as code points. When PCRE2_JIT_INVALID_UTF option is enabled the generated code can detect these bytes. Since they are not valid code points nothing matches to them, not even a dot with PCRE2_DOTALL option or a \p{Any}. Zero width assertions require valid code points as well, e.g. a word boundary check (\b) fails if either side is not a valid UTF character. Therefore the result of a successful match is always a valid UTF string regardless of PCRE2_JIT_INVALID_UTF option.

While enabling PCRE2_JIT_INVALID_UTF option has a performance overhead, it might be still faster that converting a binary data to valid UTF first, especially if a match is found at the beginning of a sizable input. Even when this option is enabled, the UTF code units must still be aligned: an UTF-16/32 subject string must be uint16_t/uint32_t aligned.

This feature is a JIT only feature, no plans to support it in the PCRE2 interpreter because of the increased runtime. Furthermore a large amount of new code has been added so if you are interested and have some time please try it. The latest code is available in the svn repository:

svn co svn://vcs.exim.org/pcre2/code/trunk pcre2

Any feedback is welcome.