[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #2 from Philip Hazel <ph10@???> 2011-01-12 16:43:21 ---
I have been advised by a WebKit developer that WebKit just uses 16-bit
characters, with no support for code points > 0xFFFF. So they do not in fact
use UTF-16, which *does* support codepoints > 0xFFFF.

One thing that might be worth doing (by me or by someone else) is to write a
function that translates from UTF-16 to UTF-8 and at the same time builds an
index from offsets in the UTF-8 string of bytes to offsets in the UTF-16 string
of 16-bit quantities. Using this, one could then call PCRE and quickly
translate any offsets that it returns into offsets in the original UTF-16 data.
It will still be a bit slow, of course, but I think this is more likely to
happen than any "native" support for UTF-16.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email