Craig Silverstein wrote:
>} My I suggest to default for the ISO-8859-1 character set instead? It
>} corresponds to the first 256 Unicode codepoints and is therefore
>} fully compatible with PCRE in UTF-8 mode.
>
>If I understand this right, I don't think that will quite work. It's
>true that latin1 are the first 256 codepoints in unicode, but the
>encoding of chars 128-256 is not the same in latin1 and utf-8.
About ISO 8859-1:
http://en.wikipedia.org/wiki/ISO-8859-1
About how this corresponds to Unicode:
http://en.wikipedia.org/wiki/Latin_Unicode
UTF-8 and ISO-8859-1 are just different ways to represent the same information UTF-8 can or course represent far more characters than ISO-8859-1, but the first 256 charcters (or code points, according to the Unicode terminology) are equal.
When in UTF-8 mode, PCRE takes care to decode UTF-8 to Unicode and looks up the respective property in chartables.c. If not in UTF-8 mode, the translation takes place indirectly by the information stored in these tables.
>Anyway, I think the "C" locale are the ISO-8859-1 characters. The
>only difference is how they sort and stuff.
The "C" locale differs among compilers (I know because I once reported a false alarm to Philip because of this).
Also, according to the documentation, the "C" locale limits case conversions to A-Z only in toupper() and tolower(). This is quite a limitation for many European languages.
Ralf