[pcre-dev] [Bug 1049] Add support for UTF-16

Αρχική Σελίδα
Delete this message
Συντάκτης: Zoltan Herczeg
Ημερομηνία:  
Προς: pcre-dev
Παλιά Θέματα: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Αντικείμενο: [pcre-dev] [Bug 1049] Add support for UTF-16
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049




--- Comment #46 from Zoltan Herczeg <hzmester@???> 2011-12-30 13:45:20 ---
Thank you for the feedback Giuseppe.

> Just out of curiosity, but which encoding do the 16 bit versions expect/support
> when PCRE is built without UTF support?


The same 8 bit char tables as before. Every character > 255 has no othercase
and no type (16 bit tables would be too big). You need to use [] ranges for
selecting the characters you need.

> Reading between the lines, am I correct when I assume that:
> - should use PCRE_UTF16 / PCRE_NO_UTF16_CHECK with the pcre16 functions
> (instead of PCRE_UTF8)?


Exactly.

> - BOM is not handled at all -- only host endianess is supported?


True again. However, we provide a utility function called
pcre16_utf16_to_host_byte_order which can convert the input to host byte order
and optionally remove BOMs during the conversion.

> - the offsets in the ovector, and the various error offsets, are in 16 bit code
> units?


Your guess is right, again. However, the error message strings are still 8 bit!

> - the name table entry length returned by pcre16_fullinfo with
> PCRE_INFO_NAMEENTRYSIZE is still in bytes, but the table itself returned by
> PCRE_INFO_NAMETABLE contains 16 bit strings (as they appear in the 16 bit
> pattern) and every row is terminated by a 16 bit NUL (0x0000)?


No. PCRE_INFO_NAMEENTRYSIZE contains the size in 16 bit characters, not bytes.

> > PCRE_SPTR16 is const short *
>
> Why not using an unsigned short here?


Don't know. Anything will do as it is 16 bit long. It is converted to
pcre_uchar internally so this type is never used for accessing memory data. I
can change this if you prefer an unsigned type.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email