Re: [pcre-dev] [Bug 1295] add 32-bit library

Top Page
Delete this message
Author: Tom Bishop, Wenlin Institute
Date:  
To: pcre-dev@exim.org Mailing List
Subject: Re: [pcre-dev] [Bug 1295] add 32-bit library
>>>> The idea is this: the programme that's using the pcre32 API wants to
>>>> use it on some data it has. That data isn't only used for matching
>>>> however, ie it may also be displayed, etc, and the programme has
>>>> therefore stored some flags into the unused-by-UTF-32 high bits of the
>>>
>>> Wow, wow... stop it right there. Back in the seventies, when we used such techniques, they were
>>> already considered IMPOLITE (or shall we say, downright wrong). And in those days, both core
>>> (actually real CORE) memory and disk (usually tape) space were expensive so there was some
>>> twisted justification for that behavior.
>>
>> This is a good point. If you have enough free space for utf-32 (where
>> you waste 2-3 bytes for nearly all characters), you probably don't
>> need those extra bits. I worry about the performance loss. On most
>> machines you need two-three instructions to do that masking. Does
>> anyone plan to use this?
>
> I too was around in the seventies, and indeed we used to do this kind of
> thing because space was very tight. Nowadays one can usually avoid it. I
> have to say that if I had done the 32-bit implementation, I would not
> have considered including masking.



"Wow, wow... stop it right there" is good advice. It may seem "impolite" advice, especially given that the masking feature was introduced, with good intentions, by the same person who did the hard work of implementing 32-bit support, for which we should all be grateful. But what's really "impolite (or shall we say, downright wrong)" (though not intentionally, of course) is to add a feature that will do more harm than good for everybody, including those who want to use it.

Lack of separation between public and private data formats is at the root of the problem. There are good reasons for published APIs to use standard data formats and protocols, and for modules to keep their implementation details private.

Anyone considering the masking feature should ask themselves, how many of the eleven bits will be used by the application? If the application has a long-term future, is there any possibility that eventually it will want more than eleven bits, and if so, what will happen then? In the long run, would it be easier and more effective to organize the data in a way that doesn't have a built-in eleven-bit limitation?

Will the application do other things with text besides regex searching, such as layout, display, collation, normalization, conversion, and translation? If so, various libraries and operating system features are available for these purposes. Will the libraries and operating systems also need to be revised to support masking? How likely is that, and how much will it cost?

Is there any plan to give the new data format a name, such as "UTF-21", and a specification that applications, libraries, and operating systems using UTF-21 can conform to? Let's be clear: UTF-32 is a 32-bit encoding form in which the high eleven bits are always zero. Anything else is not UTF-32. Unicode defines multiple encoding forms, and you can define more if you'd like, but you can't redefine UTF-32.

It was suggested that supporting this format is analogous to supporting gzip, but there are important differences. gzip is a well-defined and widely implemented standard; UTF-21 isn't. gzip-compressed text is very different from plain text, and unlikely to be confused with it. In contrast, UTF-21 appears to be an extension of UTF-32, meaning that any UTF-32 string is also a UTF-21 string. Consequently, the easiest way for a library to implement both UTF-32 and UTF-21 is simply to implement UTF-21; that is, always mask the high eleven bits. But, if libraries take this lazy approach, UTF-32 validity checking is out the window, with bad consequences for error detection and security. Effectively, each code point would have 2048 equivalent codes, rather than only one. A standard protocol would need to be defined, published, and implemented, to determine when to mask and when not to mask.

Currently, PCRE sets a horrible precedent for a protocol. PCRE_NO_UTF32_CHECK has two meanings at the same time: "don't check the input, since we already know it's valid UTF-32" and "mask the input, since it's not UTF-32, it's really UTF-21". At least this needs to be fixed before the next version of PCRE is released. However, I've changed my mind. I did think it would be OK to support masking as an option, as long as it was only turned on by an explicit option, off by default. Now I think it will be better if everyone can be convinced that it should be removed entirely. Still, I'm willing to help improve the protocol, if there's a decision about what should be done. There should be a decision.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: wenlin@???     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯