[pcre-dev] [Bug 1049] Add support for UTF-16

Author: Zoltan Herczeg
Date:
To: pcre-dev
Old-Topics: [pcre-dev] [Bug 1049] New: Add support for UTF-16
Subject: [pcre-dev] [Bug 1049] Add support for UTF-16

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1049

--- Comment #9 from Zoltan Herczeg <hzmester@???> 2011-11-12 08:05:30 ---
I moved Graycode's comment to the bug since he same valid arguments.

> I'm somewhat confused about what this enhancement asks for.
> Here's a few points that may need to be clarified.

> 1. Would different byte order sequences need to be accommodated?
> There may be little or big endian, and perhaps always assuming one
> might not always be appropriate. Picking up a value with
> *(unsigned short int) could yield the wrong interpretation.
>
> ME: I think PCRE should only do the native endian of the machine
> that it's combiled on. x86 happens to be little endian. If any
> cross-endian support is needed then it should be left to the
> application program to swap the data byte pairs.

I was thinking about it before, and we should follow the typical PCRE approach
here. PCRE should not cover all possible use cases only the most common ones
because of performance reasons. A typical application convert UTF16 input to
host byte-order and remove BOM first, since it cannot process the text
efficiently otherwise. Thus PCRE should not worry about BOM or byte order as
well, but we should add utility functions to help simple applications:

// The following function convert any utf16 input to host byte order.
// It removes all BOMs, and swap byte order if the next character is not
// in host byte order according to the last seen BOM.
// Default is host byte order.
// Since conversion happens left to right, dst can be the same as src (or less
// than src), or must point to a different buffer.
// length measured in characters not in bytes.

int pcre_utf16_convert_to_host_byte_order(wchar_t dst, wchar_t src, int
length);

> 2. Would there be consideration for data that begins with a UTF-16
> byte order mark? When present, it indicates the endian of the
> data. "\xFF\xFE" (little endian), "\xFE\xFF" (big endian).
> Would that be skipped over and ignored, treated as data, or
> would that be actually used by the PCRE library to swap its
> treatment of the byte order?

> ME: I think PCRE should treat a byte order mark as data and not even
> try to detect it. Where byte order marks exist, it should be the
> application program's responsibility to accommodate them.

My opinion above.

> 3. Would it be acceptable if the search pattern given to PCRE is
> always be ASCII or UTF-8? Does PCRE need to accommodate UTF-16 in
> the search pattern too? The desire is to support UTF-16, but how
> well can that be done without UTF-16 in the search pattern?

> I think PCRE's use of a UTF-16 (in native endian) search pattern may
> be what is being requested. But I'm not sure. There may be some
> assumed functionality that needs to be clarified.

The current approach sets the mode flags at compile time, thus a compiled UTF8
pattern cannot be used for ASCII match later. I think we should keep this
approach.

As a low-level priority task, we could implement conversion helpers between
different formats.

> Instead of an ASCII pattern such as "\sabc[def]" someone could
> manipulate it to be "\x00\s\x00a\x00b\x00c\x00[def]". There, is that
> enough "support for UTF-16"? Probably not? If it was then there
> could be discussion of pattern alteration schemes without overhauling
> the underlying engines, and then perhaps optimization enhancements to
> speed searching with such patterns if that was appropriate.
> A module concerned with translating a UTF-16 search pattern into
> something that can search UTF-16 data using the current PCRE engine
> should also accommodate big vs. small endian in the resultant pattern.

Yeah it is not enough, since it would only work for caseless fixed strings.

> Matching of UTF-16 data probably(?) involves the ability to specify
> search arguments that contain native UTF-16 code points. Yet for that
> to happen PCRE would have to parse a UTF-16 "string" to identify the
> regular expression syntax and semantics. The compile() process may
> need to be told whether it's a compilation for ASCII/UTF-8 or one
> for UTF-16. You might also want it to be told the strlen() of the
> pattern or the common UTF-16 single char zero could be misinterpreted.

Philip prefers a different approach and I support his idea. We should extend
the API with individual compile/study/exec/utility functions for each input
types. The question is how to do it. Individual library for each modes, or some
creative use of the C preprocessor (I would prefer this one but I am not
against the other).

> Let's consider pcregrep as being an application program. With UTF-16
> support in the PCRE base library then that program should be able to:
>
> + Look for a UTF-16 byte order mark at the start of a file.
> If none, proceed without UTF-16 involvement.
> It might also test for (and perhaps skip over to ignore) the UTF-8
> byte order mark "\xEF\xBB\xBF" (3 bytes).

I like this idea but we need to be careful since grep should work on binary
data as well, so we should add force flags perhaps.

> + If it detects a UTF-16 byte order mark that is different than the
> native compile, the program should swap all data byte pairs read
> from that file.

With the utility above. Does C have a wchar version of common utilities like
gets?

> + But what about the pcregrep command-line parameters? Should the
> search pattern be specified as UTF-8 and converted to UTF-16 for
> possible use in scanning files that have a UTF-16 byte order mark?
> Does supporting UTF-16 in pcregrep involve compiling twice - once
> treating the pattern as ASCII and a different one having a UTF-16
> pattern for the cases where UTF-16 content is detected?

Shouldn't.

> The people asking for UTF-16 support should be more precise about what
> is being requested and what their expectation would be. There may be
> issues related to PCRE compile() as well as exec(). Or perhaps the
> implementation considerations should be more focused on UTF-16 search
> patterns rather than methods of traversing UTF-16 data.
>
> References:
> UTF-8 http://www.ietf.org/rfc/rfc2781.txt
> UTF-16 http://www.ietf.org/rfc/rfc2781.txt
>
> PS - I'm a dis-interested party other than the potential impact to the
> base PCRE library.

Your feedback is very welcome. We could lear a lot from it.

> Regards,
> Graycode

I think PCRE is the best library out there, and if we could add this feature
somehow, it would be perfect.

Zoltan

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Graycode at
	Philip Hazel at