[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #15 from Philip Hazel <ph10@???> 2011-11-14 12:05:13 ---
On Mon, 14 Nov 2011, Zoltan Herczeg wrote:

> - We should choose between the two modes in configure time. Configure can be
> called from a different build directory so creating both libraries from the
> same source is no trouble (I usually do this because I don't like mixing source
> and object files anyway).


I don't mind doing it that way. It would be simpler than trying to build
both libraries at once.

> - pcre.h: We should provide a pcre16.h for the 16 bit library. Both headers
> shouldn't be included in the same time (although different C files can include
> different ones). Perhaps this would not cause any trouble but who knows.


That worries me just a little bit. With suitable use of C macros, it
should be possible to allow both to be included. pcre.h already tests
_PCRE_H so that it is safe to include it twice; I'm sure we can do
something similar.

> - LINK_SIZE 3 would be the same as LINK_SIZE 4 in 16 bit mode.


As the code will be working in 16-bit quantities, we will, internally,
have to work with LINK_SIZE/2. I agree that making 3=4 is probably more
useful than faulting it.

> - PCRE_UTF8 would be replaced to PCRE_UTF16 in pcre16.h. (No need to allocate a
> new flag)


I would like to allocate a new flag so that we can detect a 16-bit
pattern that is erroneously passed to an 8-bit matcher. (I just have the
feeling that somebody is sure to come up with an application that
handles both sizes.) PCRE is short of option bits at the moment; however
there are plenty in its private flags field, so 8 or 16 could be
remembered there.

> I have a solution for the reload ability as well: we could provide a conversion
> function between different endianness (although this would be a low priority
> task at the moment). Currently it would be enough to mark the endianness with a
> flag and pcre_exec(...) would simply return with an error in case of a bad
> endianness.


That would work. Note that there is already code to cope with the
endianness of the data in the real_pcre block.

> By the way, does anyone know a clever way of compile time endianness check?


Don't you just set up an int and then cast a pointer to it as char * and
then look at the bytes?

> Philip I would like to help you in this work.


Thank you. That will be helpful. We will need to do some planning and
I guess we should not fill up the bugzilla with all of that. I'll mail
you privately.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email