[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Zoltan Herczeg
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049

Zoltan Herczeg <hzmester@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hzmester@???





--- Comment #6 from Zoltan Herczeg <hzmester@???> 2011-11-09 23:23:16 ---
Thank you for considering PCRE as a replacement. I am involved in QtWebKit
development and I like Qt personally very much.

UTF16 support is an important requirement since it is quite widespread now. On
the long run I feel its support is unavoidable, since conversion is expensive
(involves a lot of memory read-writes) especially on long inputs.

> In both those cases, I do not know how much this would affect the
> newly-added JIT facilities (Zoltan wrote the code for JIT; perhaps he'll
> respond as well).


>From JIT point of view, this is a quite easy task, since it would only affect

the compiler (not the compiled machine code), and I tried to design the
character handling (read, peek, skip, ...) to be modular, so it should be
fairly easy to support any character types (although some code tidy is needed).

> MAJOR RETHINK: if one abandons the current API, a possible way of
> re-implementing PCRE would be to replace every "load character",
> "advance character", and "backup character" by macros. Then one could
> compile three different versions of each function (e.g.
> pcre_compile_ascii, pcre_compile_utf8, pcre_compile_utf16) from the same
> source code, with different macro definitions. The application would
> then only load whichever one(s) it chose to use. But this is a BIG
> REVOLUTION. I am now retired and I am not at all sure that I ever
> could/will attempt anything on such a scale. But I thought it was worth
> getting the idea on record.


I was also thinking about it before, and I had exatly the same thoughts
(conclusions) except the new API. Wow I am really surprised now!

The code would look something like:

There would be a "pcre_exec_internal.c" which would contain all 3 (normal,
utf8, utf16) methods separated by ifdefs.

pcre_exec would look like:

#ifdef UTF8_SUPPORT
#define UTF8_MODE
/* defines 'static pcre_exec_utf8(...)' */
#include "pcre_exec_internal.c"
#undef UTF8_MODE
#endif /* UTF8_SUPPORT */

#ifdef UTF16_SUPPORT
#define UTF16_MODE
/* defines 'static pcre_exec_utf16(...)' */
#include "pcre_exec_internal.c"
#undef UTF16_MODE
#endif /* UTF16_SUPPORT */

/* defines 'static pcre_exec_ascii(...)' */
#include "pcre_exec_internal.c"

int pcre_exec(...)
{
/* re - points to the pcre byte code. */
#ifdef UTF8_SUPPORT
if ((re->flags & UTF8) != 0) {
/* Since this is a static function, gcc will inline it. */
return pcre_exec_utf8(...);
}
#endif /* UTF8_SUPPORT */

#ifdef UTF16_SUPPORT
if ((re->flags & UTF16) != 0) {
/* Since this is a static function, gcc will inline it. */
return pcre_exec_utf16(...);
}
#endif /* UTF16_SUPPORT */

/* Since this is a static function, gcc will inline it. */
return pcre_exec_ascii(...)
}


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email