Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Author: Philip Hazel
Date:
To: Tom Bishop, Wenlin Institute
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1049] Add support for UTF-16

On Mon, 14 Nov 2011, Tom Bishop, Wenlin Institute wrote:

> Maybe you've thought of all this already, but I mention it in case
> it's helpful.

Thank you for taking the time to contribute. It's appreciated.

> The differences between UTF-8, UTF-16, and UTF-32 are essentially
> trivial and unrelated to the significant complexities of regular
> expression implementation. (Disclaimer: I know more about UTF
> encodings than I do about regex.) Ideally, the modules that need to be
> different for UTF-8, UTF-16, and UTF-32 should be encapsulated to form
> relatively very small parts of the PCRE source code.

The problem with encapsulating too much is performance. For example, we
*could* change PCRE so that every time it needs to "load next character"
it calls a function. The function could check for 8,16,32, UTF status,
and then load the data as required. However, I reckon this would kill
PCRE's performance, which does matter in many applications. Currently,
in many cases in the 8-bit environment, a simple byte load is all that
is necessary. Even in UTF-8 mode when checking for metacharacters, for
example, you can write code such as if *p=='*' and it works for both
ascii and UTF-8 because of the clever design of UTF-8.

Many functions in PCRE are large, and code is often repeated with very
minor changes, and the reason is to avoid too many function calls, for
better performance.

> * the size of code units (whether 8-bit, 16-bit, 32-bit, or even
> * 64-bit or larger) -- use a configurable typedef like PcreCodeUnit
> * (or USDATA if you prefer, but PcreCodeUnit is more descriptive);

... but it's data units as well as code units... :-)

> * the maximum value for a UCS Scalar Value (currently U+10FFFF but it
> * has changed more than once before, and Perl 5.8 supports much larger
> * code points) -- use a configurable macro like PCRE_MAX_USV;

Internally PCRE supports code points up to 7fffffff in UTF-8, but you
have to suppress its UTF-8 validation check to use them.

> * the maximum length of a code, measured in code units (e.g., 4 for
> * UTF-8,

PCRE supports up to 6 for UTF-8. :-)

> If it's done right, the solution for UTF-32 will come practically for
> free along with the solution for UTF-16, and PCRE will easily be
> extended to support applications we can't even imagine yet, for many
> decades to come. To do it the unencapsulated way, only extending to
> UTF-16 and hard-coding numbers like 16, 0x10FFFF, and 2, and the
> notion of "surrogate pair", all through the source code, would be very
> 20th-century, a kind of built-in obsolescence.

These are good points, but I think PCRE is already quite well covered.
Even the current code uses macros for "load next character" in UTF-8
mode, and I expect that (for example) the "surrogate pair" stuff will go
into a similar macro for UTF-16. The 0x10FFFF limit is only in the
"validate UTF-8" function.

It would be very nice to have a single set of functions and options to
specify data unit size at run time, but I don't think that is acceptable
from a performance point of view. That is why the current suggestion is
to have different function sets for different sizes. I agree that we
should do our best so that the work to provide libpcre16 can trivially
be extended to provide libpcre32 if that is ever wanted.

[Aside: I've been a programmer for over 40 years. Today's machines as
vastly faster, with huge memories, and yet we still worry about
performance. I suspect we always will. :-]

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Zoltan Herczeg at