Re: [pcre-dev] [Bug 1295] add 32-bit library

Góra strony
Delete this message
Autor: Tom Bishop, Wenlin Institute
Data:  
Dla: pcre-dev@exim.org
Temat: Re: [pcre-dev] [Bug 1295] add 32-bit library

On Sep 17, 2012, at 5:09 PM, Christian Persch (GNOME) <chpe@???> wrote:

> It's absolutely certain that there will never be unicode characters > 10ffff,
> so there's no forward compatibility problem.


A few years ago it was "absolutely certain" there would never be Unicode characters > U+FFFF. As a result, a lot of supposedly Unicode-based software still in widespread use fails for characters outside the Basic Multilingual Plane. Should we learn from such mistakes, or repeat them?

> Now you seem to want some sort of "UCS-4" mode that would allow any characters
> from the 31-bit range (up to 7fffffff) of UCS-4 ? I don't see how that would be
> useful; for example, which properties would those characters beyond the UTF-32
> range have ?


By default, the same properties as for unassigned code points less than U+110000. Especially relevant to this discussion, an essential property for each character is that it shouldn't be matched with some other character without a valid reason.

One application of code points beyond U+10FFFF is for extended private use. Properties for all unassigned characters could be specified by the same protocols as for ordinary private-use characters. It should be possible to specify custom properties for each character, including those in the current private-use ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. For example, depending on the application, people may want to treat some private-use characters as letters, numbers, whitespace, or combining marks. (This is an ability PCRE really should have anyway.)

> (And if an actual use case for that UCS-4 mode ever arises, we can
> just add it at that point as a _new_ flag/mode.)


It might be best to design the API and add a few lines of code now, while all the authors are alive, and before assumptions about PCRE have been hard-coded into applications that depend on it.

Three possible behaviors are under consideration, when a 32-bit string contains a code unit > 0x0010FFFF:

(1) trigger an error for invalid UTF-32;
(2) mask it with 0x001FFFFF; or
(3) treat it as a character in its own right.

I think I understand that (1) will be the default (which is good), and that (2) can currently be obtained by turning on the PCRE_NO_UTF32_CHECK option. You said that the masking is "only a temporary measure while developing this". It's not clear what that implies: once the development is complete, would the PCRE_NO_UTF32_CHECK option still produce behavior (2), or would the masking code be removed and the PCRE_NO_UTF32_CHECK option produce behavior (3)?

It seems that there are three possible purposes for someone to specify an option named PCRE_NO_UTF32_CHECK:

(A) simply to speed up the code a bit, since they're absolutely certain that their strings are valid UTF-32;
(B) to obtain behavior (2) since they've included extra information in the eleven highest bits; or
(C) to obtain behavior (3) to support characters beyond U+10FFFF.

For purpose (A), suppose on some rare occasion the absolute certainty is mistaken; then the best behavior for PCRE is (3), since 0x10000021 isn't a valid code for an exclamation point (U+0021) and PCRE shouldn't report a match when in reality there isn't a match.

The difference between behaviors (2) and (3) is huge. If only one or the other is supported, (3) is more appropriate -- again, PCRE shouldn't report a match when in reality there isn't a match. If the masking is considered a useful option for the long term and not only a temporary measure, then there could be two options in addition to the default (strict UTF-32 checking). They might be named:

PCRE_MASK_UTF32_BEYOND_1FFFFF for behavior (2)

and

PCRE_ALLOW_UTF32_BEYOND_10FFFF for behavior (3).

This might only require a few additional lines of code. I'm happy to help with the implementation.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: wenlin@???     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯