[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Zoltan Herczeg
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #44 from Zoltan Herczeg <hzmester@???> 2011-12-29 19:02:01 ---
> ... but please note that the documentation has not been touched at all.
> That is my next big job.


True, but the library itself is ready to try, and we would be really happy if
you would give us feedback about the library (especially from those who plan to
use it).

Let me give you a short guide about the 16 bit PCRE library.

When you checkout PCRE you need to create the configure script first. The
following two command do this:
svn co svn://vcs.exim.org/pcre/code/trunk pcre
libtoolize -c -f && aclocal && autoheader && automake -a -c && autoconf

Here are the new options added to configure:
  --enable-pcre8 / --disable-pcre8 (enabled by default)
  --enable-pcre16 / --disable-pcre16 (disabled by default)
  --enable-utf / --disable-utf (disabled by default)
       replaces --enable-utf8 which is obsolote from
       now on, although it is kept for compatibility for some time.
       The value set for --enable/disable-utf8 is simply
       copied to --enable/disable-utf


With --enable-pcre8 the usual libpcre is created. Same performance, same binary
size, it is not really changed. We hope this keep users of the 8 bit library
happy. However, with --enable-pcre16 a new, libpcre16 library is created, which
contains the 16 bit functions.

About the new API:
We still have a single pcre.h, which contains the forward declarations of both
8 and 16 bit functions. This is not an issue in C, just don't use functions
from a library which is not linked to your application (with -lpcre or
-lpcre16).

The API itself is pretty simple: every function has a 16 bit counterpart,
starting with pcre16_ prefix. I.e: pcre_compile, pcre16_compile. That's all.
They have the same arguments, except some char* pointers are replaced to short*
when appropriate.

Example:
PCRE_EXP_DECL pcre *pcre_compile(const char *, int, const char **, int *,
                  const unsigned char *);
PCRE_EXP_DECL pcre *pcre16_compile(PCRE_SPTR16, int, const char **, int *,
                  const unsigned char *);


PCRE_SPTR16 is const short *

Warning: do not mix 8 and 16 bit API! Example: value returned by pcre16_study
must be freed by pcre16_free_study! Segfaults will occure if you use
pcre_free_study! Also, use pcre16_free for freeing the compiled regex returned
by pcre16_compile, not pcre_free. Keep in mind that all functions and static
variables are duplicated!

One more thing: --disable-pcre8 also disable posix API, pcregrep and pcrecpp.
These utilities have no 16 bit counterpart, and not planned at the moment until
someone need them (especially for posix API, since posix standard has no 16 bit
API for regex).

However --enable-utf --enable-unicode-properties and --enable-jit is supported
by both 8 and 16 bit libraries as well.

Updating the CMAKE build system is still an open task, we would be really happy
if someone could help us!


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email