[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #1 from Philip Hazel <ph10@???> 2011-01-05 16:29:24 ---
A quick look at that website suggests to me that they started from PCRE, but
converted it to use UTF-16, which would be a much easier thing to do than
adding UTF-16 support while retaining support for 8-bit and UTF-8 character
strings.

The advantage of UTF-8, from PCRE's point of view, is that all the
metacharacters in patterns are ASCII one-byte characters, so they are the same
in UTF-8 and non-UTF-8 mode. Changing to UTF-16 is a whole new ball park,
because ALL characters then change.

I imagine what Webkit did was to change to using 16-bit rather than 8-bit
character strings, and then add code to handle characters whose code points are
greater than 0xffff. This might be not too hard if one just changed "unsigned
char *" to "unsigned short int *". However, doing this removes the ability to
handle ASCII and UTF-8, which of course is not reasonable for PCRE in general.

The Unicode FAQ has this to say: "UTF-8 is most common on the web. UTF-16 is
used by Java and Windows. UTF-32 is used by various Unix systems. The
conversions between all of them are algorithmically based, fast and lossless.
This makes it easy to support data input or output in multiple formats, while
using a particular UTF for internal storage or processing." While I will
continue to support PCRE in my retirement, I am afraid that I very much doubt
that I will undertake the (in MHO) substantial project of adding UTF-16
support. I thought I should say this, so you know where the matter stands.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email