Re: [pcre-dev] [Bug 1295] New: add 32-bit library

Startseite
Nachricht löschen
Autor: Tom Bishop, Wenlin Institute
Datum:  
To: pcre-dev
Betreff: Re: [pcre-dev] [Bug 1295] New: add 32-bit library

On Sep 14, 2012, at 3:26 PM, Christian Persch (GNOME) <chpe@???> wrote:

> ...Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful for
> implementations to use the upper bits to store extra info (flags, etc). Since
> it's more efficient to pass the unmodified strings to pcre32, I aim to make
> pcre32 mask out those upper bits. This is done in the code but hasn't been
> debugged yet (it's not working yet).


I suggest that such masking behavior should not be the default, but only enabled, if at all, by explicitly setting some configuration option.

If a 32-bit string contains a code unit such as 0x10000021, the safer assumption is that it is *not* equivalent to U+0021.
0x10000021 might trigger a warning that the string is not valid UTF-32, or it might just be treated as a different character. But to treat it by default as matching U+0021 would be just as wrong as an ASCII-based program treating 0xA1 as equivalent to 0x21.

The originally ASCII-based programs that continue to work well today (for Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 differently from 0x21, and refrain from masking/bending/folding/mutilating it.

Using the upper bits of 32-bit code units for flags, etc., risks incompatibility with future use of code points beyond U+10FFFF (such for extended private use); developers need to weigh the risks and benefits of such an approach carefully. Anyway, if they do it, they should at least be responsible for setting an option instructing PCRE to mask the high bits. In general, most libraries shouldn't be expected to mask or ignore those bits.

I hope this suggestion is helpful. A 32-bit PCRE is likely to be useful for the long-term future, especially if code points beyond U+10FFFF are eventually employed.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: wenlin@???     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯