Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with

Góra strony
Delete this message
Autor: Ralf Junker
Data:  
Dla: Craig Silverstein
CC: pcre-dev
Temat: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
Craig Silverstein wrote:

>} My I suggest to default for the ISO-8859-1 character set instead? It
>} corresponds to the first 256 Unicode codepoints and is therefore
>} fully compatible with PCRE in UTF-8 mode.
>
>If I understand this right, I don't think that will quite work. It's
>true that latin1 are the first 256 codepoints in unicode, but the
>encoding of chars 128-256 is not the same in latin1 and utf-8.


About ISO 8859-1:

http://en.wikipedia.org/wiki/ISO-8859-1

About how this corresponds to Unicode:

http://en.wikipedia.org/wiki/Latin_Unicode

UTF-8 and ISO-8859-1 are just different ways to represent the same information UTF-8 can or course represent far more characters than ISO-8859-1, but the first 256 charcters (or code points, according to the Unicode terminology) are equal.

When in UTF-8 mode, PCRE takes care to decode UTF-8 to Unicode and looks up the respective property in chartables.c. If not in UTF-8 mode, the translation takes place indirectly by the information stored in these tables.

>Anyway, I think the "C" locale are the ISO-8859-1 characters. The
>only difference is how they sort and stuff.


The "C" locale differs among compilers (I know because I once reported a false alarm to Philip because of this).

Also, according to the documentation, the "C" locale limits case conversions to A-Z only in toupper() and tolower(). This is quite a limitation for many European languages.

Ralf