[pcre-dev] [Bug 1049] Add support for UTF-16

Author: Philip Hazel
Date:
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049

--- Comment #12 from Philip Hazel <ph10@???> 2011-11-14 10:25:27 ---
I have written a short document that discusses options for supporting 16-bit
data and modifying memory management (bug 1174). I think it's short enough to
post here:

PCRE: SUPPORT FOR MEMORY MANAGEMENT AND 16-BIT DATA

Philip Hazel
14 November 2011

Recently there have been discussions of two issues that cannot easily be
supported within the current API:

(1)  More general support for memory management: the current static variables
     mean that all threads must use the same functions, and there is no way for
     a library that uses PCRE to isolate itself from the functions used by the
     calling application. This is Bugzilla #1174.

(1a) A corollary to (1) is the support for a callout function, which also
     currently uses a static variable.

(2)  Support for UTF-16 character strings, and possibly also 16-bit, but not 
     UTF-16, character strings. The latter are often called wchar strings, but 
     strictly, wchar does not have to be 16-bits wide. This is Bugzilla #1049.

How can PCRE be extended to deal with these issues?

CALLOUT FUNCTIONS

The callout issue is separate to the others, and can be handled compatibly by
adding a new field to the pcre_extra structure that points to a callout
function. There is already a field that contains an argument for callouts.
There is no difficulty here, so this is not discussed further.

MEMORY MANAGEMENT

Memory management affects the following areas in the library:

  pcre_compile
  pcre_exec
  pcre_get_xxx
  pcre_maketables and dftables.c
  pcre_study
  regexec (in the POSIX wrapper)
  the JIT compiler

The only one of these that could be compatibly extended is pcre_exec, where the
pcre_extra structure could be used. However, there seems no point in this when
the other areas cannot be handled compatibly.

A compatible memory solution that partially solves the library-calling-PCRE
issue is to add an option PCRE_USE_MALLOC, which would force PCRE to ignore
the static variables, and always use malloc/free. This works for the
implementor of the library who raised this issue, but is not of course fully
general.

16-BIT SUPPORT

Support for UTF-16 in a totally compatible manner would presumably involve a
PCRE_UTF16 option, similar to PCRE_UTF8. However, the code would have to be
changed a lot more than was needed for UTF-8 support, because of the different
data size. I do not think this is a sensible solution.

The minimal way to handle UTF-16 is to provide an interface that does internal
translation, keeping an index of byte-offsets to char-offsets so that the
caller does not need to handle them. The disadvantage is the resources needed
to do the translation for each and every call to functions that handle strings.
This does not appear to be an acceptable long-term solution.

If a complete redesign were done, we should consider all aspects of the API,
not just these particular issues. It would be a chance to tidy up things that
were bodged in the past (to retain compatibility). In earlier discussions I
raised the idea of such a redesign, and having several sets of functions such
as pcre_comile_ascii, pcre_compile_utf8, pcre_compile_utf16, etc., built from
the same source but with different compile-time options and macros.

HOWEVER: I no longer think this is the way we should go, for several different
reasons:

(1) Evolution is usually better than revolution.

(2) An incompatible API change would mean supporting both old and new for a
while, and many programs wouldn't change for many years (I speak from
experience). It is a maintenance and documentation load we can do without.

(3) I remembered about the (*UTF8) feature of PCRE patterns that allows the
writer of the pattern to select UTF-8 support. This means that separating ASCII
and UTF-8 support isn't really sensible.

(4) I am retired and getting older and no longer have as much time or energy as
I used to have for working on PCRE. A really big project is probably not
something I should undertake. What I propose below is big enough. :-)

PROPOSAL

I'd like to put forward the following proposal:

(1) Provide a callout function pointer in pcre_extra.

(2) Add PCRE_USE_MALLOC as a quick-and-easy way to help the library writer.

(3) Leave the rest of the current API as it is, but define a new set of 16-bit
functions. I suggest the names pcre16_compile, pcre16_exec, etc. These
functions will operate on 16-bit data strings, in host byte order, and could
support plain 16-bit characters or UTF-16 via an option. They could (should?)
be built into an entirely separate library, e.g. libpcre16.

I am assuming that there are not many programs that want to handle both 8-bit
and 16-bit data strings, but those that do would have to use both libraries.

The 16-bit library will use strings of 16-bit values instead of 8-bit values.
To implement it, we first go through the existing code, replacing all 8-bit
data string definitions such as unsigned char with macros, e.g. USDATA,
defaulting to 8-bit definitions. Then the same code can be re-compiled with
different macro definitions for the 16-bit version. The names of the functions
must also be handled with a macro. There are other details, such as how
LINK_SIZE is handled, but you get the idea.

Whether pcretest is extended to allow for 16-bit data, or whether we have an
entirely separate pcretest16 is something that needs considering. Likewise,
pcregrep.

Compiled 16-bit patterns will be a sequence of 16-bit values instead of 8-bit
values and so will in general be up to twice as big.

Thought will be needed about the save/restore facilities. It may be that they
will no longer work for saved patterns reloaded into a system with different
endianness. I do not know how much use is made of this facility.

DOWNSIDE

This proposal does not provide fully general memory handling facilities.
However, it does not preclude adding them later.

What do people think?

*** End ***

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Cameron Kaiser at
	Zoltan Herczeg at