Author: Ralf Junker Date: To: pcre-dev Subject: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
Philip Hazel wrote:
>I don't like that idea because it is just one very specific case. What
>about the people who would prefer ISO-8859-2, etc?
Yes, there many different character sets but it is not up to the regex library to support them all: There are solutions available for this job, for example libiconv, which supports over 130 of different charsets and encodings. People should use this to convert existing (or even upcoming) character sets to UTF-8 and feed this to PCRE in UTF-8 mode.
Unfortunately, ISO-8859-2 does not help when PCRE is used in UTF-8 mode. Only ISO-8859-1 is an exact mapping of the first 255 Unicode code points.
>It has occurred to me that we could distribute several tables in
>pcre_chartables.c and make them configurable, so you could say
>
> ./configure --default-tables=ISO-8859-1
>
>for example.
You could of course do that, but I believe that you should be prepared to answer question like: Why don't you support encoding Y? Will you add charset X?
>The default default would remain ASCII.
>
>> For non-European languages, it would also be nice if a future version
>> PCRE would have a callback function to convert case for codepoints
>> greater than 255.
>
>Isn't that what Unicode is all about? With Unicode property support,
>PCRE does this already in UTF-8 mode.
You are correct in sayin that PCRE supports Unicode when in UTF-8 mode. But then the chartables.c should also support Unicode. ASCII is a very small subset of Unicode, so there is nothing wrong with using that. But it is wasted space since the upper 128 characters are not used by ASCII. With ASCII, caseless matching is not possible for German umlauts and French accented characters. Only ISO-8859-1 makes this possible.
A little side note: ISO-8859-1 only adds to ASCII, it does not modify or take away anything from the lower 128 characters!
I think the chartables.c question is a political one: It depends on if you put the PCRE focus on UTF-8 mode or not: Choose ASCII to point out PCRE's non-UTF-8 mode, but go for ISO-8859-1 to stress PCRE's Unicode capabilities and if you think that UTF-8 will become the de facto Unicode Encoding of the future.