Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 1049
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1049] Add support for UTF-16
On Sat, 12 Nov 2011, Zoltan Herczeg wrote:

> > ME: I think PCRE should only do the native endian of the machine
> > that it's combiled on. x86 happens to be little endian. If any
> > cross-endian support is needed then it should be left to the
> > application program to swap the data byte pairs.


There is also the issue of saving/restoring compiled patterns, which
PCRE currently supports. Because the bytecode works entirely in bytes,
the endian issues are confined to the pattern's data structure, not to
the compiled bytecode. Support for 16-bit quantities will impact on
this. If we go for a major new API, support for saving/restoring could
be dropped, of course. I don't know how many people actually use this
feature.

> Philip prefers a different approach and I support his idea. We should extend
> the API with individual compile/study/exec/utility functions for each input
> types. The question is how to do it. Individual library for each modes, or some
> creative use of the C preprocessor (I would prefer this one but I am not
> against the other).


I think we need the C preprocessor in both cases. :-)

> With the utility above. Does C have a wchar version of common utilities like
> gets?


For the last 20 years I have been working from a copy of the C90
Standard. Maybe it's time I took a look at C99. :-) I see that Wikipedia
has some information, but says this:

However, the ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"The width of wchar_t is compiler-specific and can be as small as 8
bits. Consequently, programs that need to be portable across any C or
C++ compiler should not use wchar_t for storing Unicode text. The
wchar_t type is intended for storing compiler-defined wide characters,
which may be Unicode characters in some compilers."

It does suggest that there are functions such as wprintf(). It also
points out that wide literal strings have to be prefixed with L, e.g.
L"something".

My feeling is that wchar_t was always a fudge and is probably best
avoided. Instead, we should make use of an explicit 16-bit character
type for UTF-16.

> Your feedback is very welcome. We could lear a lot from it.


Indeed. I agree with this absolutely. Thank you.

> I think PCRE is the best library out there, and if we could add this feature
> somehow, it would be perfect.


Thank you for that.

NOW: there's been a lot of useful discussion in the last few days. I
think I will go away and write a document and try to summarize the
various points and issues so that they are all brought together. This
will help me (and I hope others) figure out where we should go next.

I will post (a link to) the document when it is done. May take a little
while. Writing text always does.

Philip

--
Philip Hazel