[pcre-dev] [Bug 1049] Add support for UTF-16

Author: Philip Hazel
Date:
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049

--- Comment #5 from Philip Hazel <ph10@???> 2011-11-09 16:11:39 ---
On Wed, 9 Nov 2011, Stephen Kelly wrote:

> --- Comment #4 from Stephen Kelly <steveire@???> 2011-11-09 14:12:47 ---
> This is also an issue for Qt 5. There is a wish to replace the QRegExp with an
> external dependency (PCRE and the v8 RE engine are discussed).
>
> PCRE is the obvious best, but incurs the cost of utf-8 to utf-16 conversion and
> the fact that pcreapi returns byte offsets rather than characters:
>
> http://thread.gmane.org/gmane.comp.lib.qt.qt5-feedback/1524/focus=1616
>
> Would a way to use pcre with utf-16 more easily be accepted?

My feeling is that to add utf-16 support directly to PCRE (in addition
to utf-8 and ascii) would not only be a lot of work, but would probably
compromise performance.

One alternative might be to make a configuration option to support
utf-16 *instead* of utf-8. Of course, that loses facilities. But this
would also be a lot of work.

In both those cases, I do not know how much this would affect the
newly-added JIT facilities (Zoltan wrote the code for JIT; perhaps he'll
respond as well).

A half-way house that is already on my "think about it" list would be
indeed to incur the cost of utf-16 to utf-8 conversion, but to wrap this
in a function that also builds an offset translation table so that the
returned byte offsets can be converted to halfword offsets. This could
all be hidden inside a set of utf-16-compatible function calls, so from
the caller's point of view it would look similar to the current API. Of
course there is the translation cost.

This approach has the benefit that it would involve no changes to PCRE,
and so would not impinge on future development, nor make it harder for
me and other maintainters in non-utf-16 worlds to work on the code.

> What would such a patch need to look like?
> Would that be a compile-time option?

Both are questions that need answering. :-)

> Would it be possible to add API which returns something else than byte offsets?

Yes, as explained above, I think that could be done compatibly, now,
without affecting the rest of PCRE.

MAJOR RETHINK: if one abandons the current API, a possible way of
re-implementing PCRE would be to replace every "load character",
"advance character", and "backup character" by macros. Then one could
compile three different versions of each function (e.g.
pcre_compile_ascii, pcre_compile_utf8, pcre_compile_utf16) from the same
source code, with different macro definitions. The application would
then only load whichever one(s) it chose to use. But this is a BIG
REVOLUTION. I am now retired and I am not at all sure that I ever
could/will attempt anything on such a scale. But I thought it was worth
getting the idea on record.

Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Stephen Kelly at
	Zoltan Herczeg at