Re: [pcre-dev] [Bug 1295] add 32-bit library

Top Page
Delete this message
Author: Christian Persch
Date:  
To: Tom Bishop, Wenlin Institute
CC: PCRE Development Mailing List
Subject: Re: [pcre-dev] [Bug 1295] add 32-bit library
Hi;

Am Sat, 27 Oct 2012 12:26:51 -0400
schrieb "Tom Bishop, Wenlin Institute" <tangmu@???>:
> I don't know if anyone has had a chance yet to study the change I
> proposed in an earlier message.


I had seen it. It's just that we had that discussion already when I
first proposed this 'high bit masking', and I thought we had agreed to
disagree :-)

First, this 'use high bits for internal purposes' idea is sufficiently
common that it's even mentioned in WP
[https://en.wikipedia.org/wiki/UTF-32#Use]; it's not something I
invented.

> Also, please note that the Unicode
> conformance requirements that currently appear to be violated include
> these:
>
> "C4 A process shall interpret a coded character sequence according to
> the character semantics established by this standard, if that process
> does interpret that coded character sequence."
>
> "C10 When a process interprets a code unit sequence which purports to
> be in a Unicode character encoding form, it shall treat ill-formed
> code unit sequences as an error condition and shall not interpret
> such sequences as characters."


I don't think this is a conformance issue.

Maybe you're just thinking about this API in the wrong way.
pcre32_exec() with the PCRE_NO_UTF32_CHECK flag doesn't mean I can feed
it invalid UTF-32; it only means that pcre picks out the UTF-32
characters from the input data in a different way. E.g. consider
a hypothetical PCRE_INPUT_IS_GZIP flag to pcre_exec() that modifies its
behaviour so that it takes the input data stream as gzip raw data,
internally gunzips it and uses _that_ as the character data to match
from. Matching would still be performed against the UTF-8 result of
gunzipping the raw data; just like here matching is still performed
against the UTF-32 that results from masking away the high bits.

> The standard also notes, "There are important security issues
> associated with encoding conversion, especially with the conversion
> of malformed text." See also <http://www.unicode.org/reports/tr36/>.


There are no security issues here. The format where the lower 21 bits
are the character, and the upper 11 bits are used for whatever purpose,
is *not* being proposed for use in data interchange; it is intended
solely for internal use of the programme that's calling this pcre32
API, and that has pre-validated the lower 21 bits.

> It's fine to have explicit ways to override these conformance
> requirements (I propose a macro PCRE_MASK_UTF32_BEYOND_1FFFFF),


I'm against a compile time switch like this; as a library user I want
to have this capability available always. A runtime check with a new
flag would be acceptable, but is more work to implement. (Also note
that there is just one bit left in the 32-bit integer for the pcre
flags, so I'd be reluctant to use it for something like this).

Suppose you have a programme using the pcre32 API.

* If the data isn't pre-checked by the programme itself, just simply
must _not_ pass PCRE_NO_UTF32_CHECK; pcre32_exec() will do that
check, returning an error for anything that's not UTF-32 (including
the case of those high bits being set!).

* If the data has been pre-checked for UTF-32-ness by the programme,
you can save a bit of runtime by passing PCRE_NO_UTF32_CHECK. This
applies whether the programme leaves the high bits as 0, or stores
internal stuff there. The bit masking simply allows the programme to
save making a sanitised copy of the data just to pass it to
pcre32_exec().

In none of these cases is external, untrusted, unchecked, data allowed
inside the matching process itself.

> but
> PCRE_NO_UTF32_CHECK in itself should only turn off the error
> reporting, not turn off the conformance.


Well, if you want to see an actual conformance issue in pcre, look at
the parallel PCRE_NO_UTF8_CHECK flag. pcreunicode.3 says about it:

"If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
what happens depends on why the string is invalid. If the string
conforms to the "old" definition of UTF-8 (RFC 2279), it is processed
as a string of characters in the range 0 to 0x7FFFFFFF by
pcre_dfa_exec() and the interpreted version of pcre_exec(). In other
words, apart from the initial validity test, these functions (when in
UTF-8 mode) handle strings according to the more liberal rules
of RFC 2279. However, the just-in-time (JIT) optimization for
pcre_exec() supports only RFC 3629. If you are using JIT optimization,
or if the string does not even conform to RFC 2279, the result is
undefined. Your program may crash.

"If you want to process strings of values in the
full range 0 to 0x7FFFFFFF, encoded in a UTF-8-like manner as per the
old RFC, you can set PCRE_NO_UTF8_CHECK to bypass the more
restrictive test. However, in this situation, you will have to apply
your own validity check, and avoid the use of JIT
optimization."

> In other words, if the user
> specifies PCRE_NO_UTF32_CHECK, of course PCRE is freed from the
> responsibility to report an error for ill-formed UTF-32, but it still
> has the responsibility not to report a regex as matching when
> (according to the standard) it doesn't match. For example, the regex
> "A" matches the code for the letter "A" (0x00000042 in UTF-32), but
> not a code such as 0x10000042.


It _is not_ matching the code 0x10000042. The input to pcre32_exec()
may contain that bit pattern, but it is simply transformed (just like
the gzip example above) before it reaches the matching stage.

In conclusion: I don't think we need this as a compile-time flag, nor
as a run-time flag.

Regards,
    Christian