Re: [pcre-dev] PCRE_UTF8 flag

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Steven Gerrard
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE_UTF8 flag
On Mon, 19 Oct 2009, Steven Gerrard wrote:

> On your homepage it said that I could contact the active PCRE
> developers via this mail address, hope this is alright.


I'm reading...

> As far as I could understand, the way of using UTF8 strings with PCRE
> is by passing the PCRE_UTF8 option to pcre_compile.


Correct.

> Now, while I understand that passing this option flag to pcre_compile
> causes non-valid UTF8 strings to fail compilation, it seems that I can
> still use UTF8 strings without passing this option to pcre_compile.


Well, yes, it will just treat the string as a sequence of characters
with values in the range 0-255.

> This makes strings get treated like plain ASCII strings, thereby
> comparing English characters case insensitively, and the rest of the
> chart (for example, Hebrew characters represented by the 128-256 part
> of the chart) using a plain binary comparison.


It will only do case-insensitivity if you set the PCRE_CASELESS flag.

> This, as far as I can see, works for me perfectly - this way I can
> pass both ASCII and UTF8 strings, which will be matched using case
> insensitive collation for English characters, and binary comparison
> for any other character.
>
> Am I missing something?


Yes. Maybe. It depends on the patterns you are using.

Because a UTF-8 character consists of several bytes, some of its bytes
can be the same as some bytes of a different character, so there is
scope for confusion. If you are using any kind of count in your pattern,
it will be a count of bytes rather than characters. Something like [^A]
in your pattern will skip just one byte, not a whole UTF-8 character. It
may be that this does not matter for your patterns, but in general it
won't work. Consider

/A[^B]C/

If the subject string is "A\x{c3}C", that is, 3 Unicode characters, the
second of which has the codepoint 0xc3, and occupies two bytes in a
UTF-8 string, the pattern match will fail unless you use UTF-8 mode.

If you are searching for *literal* strings, I wouldn't bother using PCRE
at all. :-)

Philip

--
Philip Hazel