Maybe you've thought of all this already, but I mention it in case it's helpful.
The differences between UTF-8, UTF-16, and UTF-32 are essentially trivial and unrelated to the significant complexities of regular expression implementation. (Disclaimer: I know more about UTF encodings than I do about regex.) Ideally, the modules that need to be different for UTF-8, UTF-16, and UTF-32 should be encapsulated to form relatively very small parts of the PCRE source code.
Some things to encapsulate:
* the size of code units (whether 8-bit, 16-bit, 32-bit, or even 64-bit or larger) -- use a configurable typedef like PcreCodeUnit (or USDATA if you prefer, but PcreCodeUnit is more descriptive);
* the maximum value for a UCS Scalar Value (currently U+10FFFF but it has changed more than once before, and Perl 5.8 supports much larger code points) -- use a configurable macro like PCRE_MAX_USV;
* the maximum length of a code, measured in code units (e.g., 4 for UTF-8, 2 for UTF-16, 1 for UTF-32, supposing PCRE_MAX_USV = 0x10FFFF) -- use a configurable macro like PCRE_MAX_UNITS_PER_CODE;
* the algorithms for determining the lengths of a codes and converting between codes and USV -- encapsulate these functions or macros so that they can easily be replaced.
If it's done right, the solution for UTF-32 will come practically for free along with the solution for UTF-16, and PCRE will easily be extended to support applications we can't even imagine yet, for many decades to come. To do it the unencapsulated way, only extending to UTF-16 and hard-coding numbers like 16, 0x10FFFF, and 2, and the notion of "surrogate pair", all through the source code, would be very 20th-century, a kind of built-in obsolescence.
Best wishes,
Tom
文林 Wenlin Institute, Inc. Software for Learning Chinese
E-mail: wenlin@??? Web: http://www.wenlin.com
Telephone: 1-877-4-WENLIN (1-877-493-6546)
☯