[pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Giuseppe D'Angelo
Date:  
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #45 from Giuseppe D'Angelo <dangelog@???> 2011-12-30 12:05:51 ---
Hi,

> True, but the library itself is ready to try, and we would be really happy if
> you would give us feedback about the library (especially from those who plan to
> use it).


First of all, thank you very, very much for the job!

> With --enable-pcre8 the usual libpcre is created. Same performance, same binary
> size, it is not really changed. We hope this keep users of the 8 bit library
> happy. However, with --enable-pcre16 a new, libpcre16 library is created, which
> contains the 16 bit functions.


Just out of curiosity, but which encoding do the 16 bit versions expect/support
when PCRE is built without UTF support?

> The API itself is pretty simple: every function has a 16 bit counterpart,
> starting with pcre16_ prefix. I.e: pcre_compile, pcre16_compile. That's all.
> They have the same arguments, except some char* pointers are replaced to short*
> when appropriate.
>
> Example:
> PCRE_EXP_DECL pcre *pcre_compile(const char *, int, const char **, int *,
>                   const unsigned char *);
> PCRE_EXP_DECL pcre *pcre16_compile(PCRE_SPTR16, int, const char **, int *,
>                   const unsigned char *);



Reading between the lines, am I correct when I assume that:
- should use PCRE_UTF16 / PCRE_NO_UTF16_CHECK with the pcre16 functions
(instead of PCRE_UTF8)?
- BOM is not handled at all -- only host endianess is supported?
- the offsets in the ovector, and the various error offsets, are in 16 bit code
units?
- the name table entry length returned by pcre16_fullinfo with
PCRE_INFO_NAMEENTRYSIZE is still in bytes, but the table itself returned by
PCRE_INFO_NAMETABLE contains 16 bit strings (as they appear in the 16 bit
pattern) and every row is terminated by a 16 bit NUL (0x0000)?

> PCRE_SPTR16 is const short *


Why not using an unsigned short here?

Thank you again,
Giuseppe D'Angelo


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email