Re: [pcre-dev] [Bug 1295] New: add 32-bit library

Góra strony
Delete this message
Autor: Zoltán Herczeg
Data:  
Dla: pcre-dev
Temat: Re: [pcre-dev] [Bug 1295] New: add 32-bit library
We have only a few (two) masks as far as I remember. In practice 16 and 32 bit modes are basically useless without UTF, since you can only set character types only for the first 256 code point. And UFT is not likely will go beyond 0x10FFFF in the foreseeable future, since only a small fragment of it is filled with characters, and that is basically cover nearly all spoken and dead (but real!) languages. UTF is not intended to be a picture library for bored graphics designers, so we can say PCRE only supports characters <= 0xfffffff without limiting any practical use cases.

I agree with the checks.

Regards,
Zoltan

"Tom Bishop, Wenlin Institute" <tangmu@???> írta:
>>

On Sep 14, 2012, at 3:26 PM, Christian Persch (GNOME) <chpe@???> wrote:>
>
> ...Since UTF-32 only occupies 21 bits of the 32-bit characters, it's useful for>
> implementations to use the upper bits to store extra info (flags, etc). Since>
> it's more efficient to pass the unmodified strings to pcre32, I aim to make>
> pcre32 mask out those upper bits. This is done in the code but hasn't been>
> debugged yet (it's not working yet).>
>

I suggest that such masking behavior should not be the default, but only enabled, if at all, by explicitly setting some configuration option.>
>

If a 32-bit string contains a code unit such as 0x10000021, the safer assumption is that it is *not* equivalent to U+0021.>
0x10000021 might trigger a warning that the string is not valid UTF-32, or it might just be treated as a different character. But to treat it by default as matching U+0021 would be just as wrong as an ASCII-based program treating 0xA1 as equivalent to 0x21.>
>

The originally ASCII-based programs that continue to work well today (for Latin1, UTF-8, etc.) are the ones that treat the byte 0xA1 differently from 0x21, and refrain from masking/bending/folding/mutilating it.>
>

Using the upper bits of 32-bit code units for flags, etc., risks incompatibility with future use of code points beyond U+10FFFF (such for extended private use); developers need to weigh the risks and benefits of such an approach carefully. Anyway, if they do it, they should at least be responsible for setting an option instructing PCRE to mask the high bits. In general, most libraries shouldn't be expected to mask or ignore those bits.>
>

I hope this suggestion is helpful. A 32-bit PCRE is likely to be useful for the long-term future, especially if code points beyond U+10FFFF are eventually employed.>
>

Best wishes,>
>

Tom>