Re: [pcre-dev] [Bug 1295] add 32-bit library

Author: Christian Persch
Date:
To: Tom Bishop, Wenlin Institute
CC: PCRE Development Mailing List
Subject: Re: [pcre-dev] [Bug 1295] add 32-bit library

Hi;

Am Mon, 17 Sep 2012 21:25:41 -0400
schrieb "Tom Bishop, Wenlin Institute" <tangmu@???>:
> On Sep 17, 2012, at 5:09 PM, Christian Persch (GNOME)
> <chpe@???> wrote:
>
> > It's absolutely certain that there will never be unicode characters
> > > 10ffff, so there's no forward compatibility problem.
>
> A few years ago it was "absolutely certain" there would never be
> Unicode characters > U+FFFF. As a result, a lot of supposedly
> Unicode-based software still in widespread use fails for characters
> outside the Basic Multilingual Plane. Should we learn from such
> mistakes, or repeat them?

I think it's a pretty safe bet. (Also as detailed below, this
'masking' in no way precludes us from later supporting that UCS-4
mode.)

> > Now you seem to want some sort of "UCS-4" mode that would allow any
> > characters from the 31-bit range (up to 7fffffff) of UCS-4 ? I
> > don't see how that would be useful; for example, which properties
> > would those characters beyond the UTF-32 range have ?
>
> By default, the same properties as for unassigned code points less
> than U+110000. Especially relevant to this discussion, an essential
> property for each character is that it shouldn't be matched with some
> other character without a valid reason.
>
> One application of code points beyond U+10FFFF is for extended
> private use. Properties for all unassigned characters could be
> specified by the same protocols as for ordinary private-use
> characters. It should be possible to specify custom properties for
> each character, including those in the current private-use ranges
> U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. For
> example, depending on the application, people may want to treat some
> private-use characters as letters, numbers, whitespace, or combining
> marks. (This is an ability PCRE really should have anyway.)

(I admit I only read UTS#18 diagonally, but I didn't see support for
that in it; it rather seems to say that the unicode properties that the
regex should match on are those as defined by unicode: "an
implementation shall support all of the properties listed below that
are in the supported version of Unicode, with values that match the
Unicode definitions for that version". Of course once you go outside
UTF-32 you can make up whatever values you want :-) )

> > (And if an actual use case for that UCS-4 mode ever arises, we can
> > just add it at that point as a _new_ flag/mode.)
>
> It might be best to design the API and add a few lines of code now,
> while all the authors are alive, and before assumptions about PCRE
> have been hard-coded into applications that depend on it.
>
> Three possible behaviors are under consideration, when a 32-bit
> string contains a code unit > 0x0010FFFF:
>
> (1) trigger an error for invalid UTF-32;
> (2) mask it with 0x001FFFFF; or
> (3) treat it as a character in its own right.

I think we should be clear that pcre32 in UTF mode expects UTF-32. When
PCRE_NO_UTF32_CHECK is not passed, we assert that. When the flag is
passed, the API user asserts to pcre that it is only passing it UTF-32
characters. The masking of the bits outside the UTF-32 range is
therefore simply a convenience for the API user, allowing him to pass
his own internal buffers instead of having to make a sanitised copy
(thus saving allocation + copying, which is potentially costly (esp.
for large data)).

> I think I understand that (1) will be the default (which is good),
> and that (2) can currently be obtained by turning on the
> PCRE_NO_UTF32_CHECK option. You said that the masking is "only a
> temporary measure while developing this". It's not clear what that
> implies: once the development is complete, would the
> PCRE_NO_UTF32_CHECK option still produce behavior (2), or would the
> masking code be removed and the PCRE_NO_UTF32_CHECK option produce
> behavior (3)?

I'm simply saying that the current patch (on the gitorious branch)
masks *also* in the UTF-32 check code path (ie when NO_UTF32_CHECK is
NOT passed), and that that bit of code will most probably be removed for
the final patch.

> It seems that there are three possible purposes for someone to
> specify an option named PCRE_NO_UTF32_CHECK:
>
> (A) simply to speed up the code a bit, since they're absolutely
> certain that their strings are valid UTF-32; (B) to obtain behavior
> (2) since they've included extra information in the eleven highest
> bits; or (C) to obtain behavior (3) to support characters beyond
> U+10FFFF.

As I said, I don't think (C) is useful behaviour, *and* this masking
in no way precludes implementing this "UCS-4" mode later on. Clearly,
that would have to be an extra mode besides 'raw' 32-bit character mode
and 'UTF-32' mode (PCRE_UTF32), requiring a new PCRE_UCS4 flag to
pcre32_compile2().

> For purpose (A), suppose on some rare occasion the absolute certainty
> is mistaken; then the best behavior for PCRE is (3), since 0x10000021
> isn't a valid code for an exclamation point (U+0021) and PCRE
> shouldn't report a match when in reality there isn't a match.

Again, in PCRE_UTF32 mode pcre32_exec() expects UTF-32; the masking of
the higher bits is simply a convenience to the API user.

> The difference between behaviors (2) and (3) is huge. If only one or
> the other is supported, (3) is more appropriate -- again, PCRE
> shouldn't report a match when in reality there isn't a match. If the
> masking is considered a useful option for the long term and not only
> a temporary measure, then there could be two options in addition to
> the default (strict UTF-32 checking). They might be named:
>
> PCRE_MASK_UTF32_BEYOND_1FFFFF for behavior (2)
>
> and
>
> PCRE_ALLOW_UTF32_BEYOND_10FFFF for behavior (3).
>
> This might only require a few additional lines of code. I'm happy to
> help with the implementation.

I don't think it would be that easy (it's at least one extra check per
character read), *and* again, I think if we want to support that
UCS-4 mode, we should make it a PCRE_UCS4 flag to pcre32_compile2() and
*not* a flag to pcre32_exec().

Regards,
    Christian

This message is part of the following thread:
	the complete thread tree sorted by date
	Tom Bishop, Wenlin Institute at
	Christian Persch at