Re: [pcre-dev] [Bug 1049] Add support for UTF-16

Top Page
Delete this message
Author: Graycode
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] [Bug 1049] Add support for UTF-16
I'm somewhat confused about what this enhancement asks for.
Here's a few points that may need to be clarified.


1. Would different byte order sequences need to be accommodated?
There may be little or big endian, and perhaps always assuming one
might not always be appropriate. Picking up a value with
*(unsigned short int) could yield the wrong interpretation.

ME: I think PCRE should only do the native endian of the machine
that it's combiled on. x86 happens to be little endian. If any
cross-endian support is needed then it should be left to the
application program to swap the data byte pairs.


2. Would there be consideration for data that begins with a UTF-16
byte order mark? When present, it indicates the endian of the
data. "\xFF\xFE" (little endian), "\xFE\xFF" (big endian).
Would that be skipped over and ignored, treated as data, or
would that be actually used by the PCRE library to swap its
treatment of the byte order?

ME: I think PCRE should treat a byte order mark as data and not even
try to detect it. Where byte order marks exist, it should be the
application program's responsibility to accommodate them.


3. Would it be acceptable if the search pattern given to PCRE is
always be ASCII or UTF-8? Does PCRE need to accommodate UTF-16 in
the search pattern too? The desire is to support UTF-16, but how
well can that be done without UTF-16 in the search pattern?

I think PCRE's use of a UTF-16 (in native endian) search pattern may
be what is being requested. But I'm not sure. There may be some
assumed functionality that needs to be clarified.

Instead of an ASCII pattern such as "\sabc[def]" someone could
manipulate it to be "\x00\s\x00a\x00b\x00c\x00[def]". There, is that
enough "support for UTF-16"? Probably not? If it was then there
could be discussion of pattern alteration schemes without overhauling
the underlying engines, and then perhaps optimization enhancements to
speed searching with such patterns if that was appropriate.
A module concerned with translating a UTF-16 search pattern into
something that can search UTF-16 data using the current PCRE engine
should also accommodate big vs. small endian in the resultant pattern.

Matching of UTF-16 data probably(?) involves the ability to specify
search arguments that contain native UTF-16 code points. Yet for that
to happen PCRE would have to parse a UTF-16 "string" to identify the
regular expression syntax and semantics. The compile() process may
need to be told whether it's a compilation for ASCII/UTF-8 or one
for UTF-16. You might also want it to be told the strlen() of the
pattern or the common UTF-16 single char zero could be misinterpreted.


Let's consider pcregrep as being an application program. With UTF-16
support in the PCRE base library then that program should be able to:

+ Look for a UTF-16 byte order mark at the start of a file.
If none, proceed without UTF-16 involvement.
It might also test for (and perhaps skip over to ignore) the UTF-8
byte order mark "\xEF\xBB\xBF" (3 bytes).

+ If it detects a UTF-16 byte order mark that is different than the
native compile, the program should swap all data byte pairs read
from that file.

+ But what about the pcregrep command-line parameters? Should the
search pattern be specified as UTF-8 and converted to UTF-16 for
possible use in scanning files that have a UTF-16 byte order mark?
Does supporting UTF-16 in pcregrep involve compiling twice - once
treating the pattern as ASCII and a different one having a UTF-16
pattern for the cases where UTF-16 content is detected?


The people asking for UTF-16 support should be more precise about what
is being requested and what their expectation would be. There may be
issues related to PCRE compile() as well as exec(). Or perhaps the
implementation considerations should be more focused on UTF-16 search
patterns rather than methods of traversing UTF-16 data.


References:
UTF-8 http://www.ietf.org/rfc/rfc2781.txt
UTF-16 http://www.ietf.org/rfc/rfc2781.txt


PS - I'm a dis-interested party other than the potential impact to the
base PCRE library.

Regards,
Graycode