[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #8 from Philip Hazel <ph10@???> 2011-11-11 15:33:09 ---
On Wed, 9 Nov 2011, Zoltan Herczeg wrote:

> int pcre_exec(...)
> {
> /* re - points to the pcre byte code. */
> #ifdef UTF8_SUPPORT
> if ((re->flags & UTF8) != 0) {
> /* Since this is a static function, gcc will inline it. */
> return pcre_exec_utf8(...);
> }
> #endif /* UTF8_SUPPORT */
>
> #ifdef UTF16_SUPPORT
> if ((re->flags & UTF16) != 0) {
> /* Since this is a static function, gcc will inline it. */
> return pcre_exec_utf16(...);
> }
> #endif /* UTF16_SUPPORT */
>
> /* Since this is a static function, gcc will inline it. */
> return pcre_exec_ascii(...)
> }


The problem with that is that an application that uses only one kind of
string, but which links statically with a library that supports all
three types of string, will drag in all three functions, and so be far
bigger than it need be. (This is already a small worry for me with the
JIT code in applications that do not use it). There were people who
linked PCRE statically in memory-poor environments (they have posted or
emailed from time to time) and I assume such uses still exist.

A separate pcre_compile_16 etc. avoids this issue, and that is why I
suggested being drastic, and making the application call one of three
functions: pcre_compile_ascii, pcre_compile_utf8, or pcre_compile_utf16,
deliberately removing the current name. (Also pcre_compile_wchar if that
is felt to be useful as well.) And likewise for study and exec and the
get_xxx functions ... it all gets very messy, but by using macros and
ifdefs it might be possible to keep the sources fairly clean.

If a program wants to use two different kinds of string (as pcretest and
pcregrep already do), this would not be a problem. There would be a
small execution performance benefit because the existing runtime tests
for utf8 would be removed.

I think, if this ever happens, we should probably change the name to
NPCRE and release it as a separate library (compare ncurses). The whole
API should be carefully designed and discussed - there are probably
other rough edges that could be tidied up - before actually implementing
anything.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email