Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Autor: Tom Bishop, Wenlin Institute
Data:
Dla: pcre-dev
Temat: Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Maybe you've thought of all this already, but I mention it in case it's helpful.

The differences between UTF-8, UTF-16, and UTF-32 are essentially trivial and unrelated to the significant complexities of regular expression implementation. (Disclaimer: I know more about UTF encodings than I do about regex.) Ideally, the modules that need to be different for UTF-8, UTF-16, and UTF-32 should be encapsulated to form relatively very small parts of the PCRE source code.

Some things to encapsulate:

* the size of code units (whether 8-bit, 16-bit, 32-bit, or even 64-bit or larger) -- use a configurable typedef like PcreCodeUnit (or USDATA if you prefer, but PcreCodeUnit is more descriptive);

* the maximum value for a UCS Scalar Value (currently U+10FFFF but it has changed more than once before, and Perl 5.8 supports much larger code points) -- use a configurable macro like PCRE_MAX_USV;

* the maximum length of a code, measured in code units (e.g., 4 for UTF-8, 2 for UTF-16, 1 for UTF-32, supposing PCRE_MAX_USV = 0x10FFFF) -- use a configurable macro like PCRE_MAX_UNITS_PER_CODE;

* the algorithms for determining the lengths of a codes and converting between codes and USV -- encapsulate these functions or macros so that they can easily be replaced.

If it's done right, the solution for UTF-32 will come practically for free along with the solution for UTF-16, and PCRE will easily be extended to support applications we can't even imagine yet, for many decades to come. To do it the unencapsulated way, only extending to UTF-16 and hard-coding numbers like 16, 0x10FFFF, and 2, and the notion of "surrogate pair", all through the source code, would be very 20th-century, a kind of built-in obsolescence.

Best wishes,

Tom

文林 Wenlin Institute, Inc.        Software for Learning Chinese
E-mail: wenlin@???     Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯

Wiadomość jest częścią wątku:
	pełne drzewo wątku posortowane wg daty
	Thorsten Schöning at
	Gertjan Halkes at