Re: [pcre-dev] PCRE is not case insensitive with Greek characters

Author: Philip Hazel
Date:
To: Dimitrios
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE is not case insensitive with Greek characters

On Fri, 13 Mar 2009, Dimitrios wrote:

> Obviously you didn't read the discussion thread in the php bug report :)

No, I'm afraid I didn't. I just don't have time (or, I have to admit,
the inclination, especially now I'm retired) to follow up reports for
applications tht use PCRE.

> In Greek, there aren't only two "a" characters, there are four:
>
> "?" == "?" == "?" == "?"
> "?" == "?" == "?" == "?"
> etc...
>
> Thus, the default method, at least as is in PHP PCRE, doesn't support
> accented characters.

PCRE uses the Unicode tables for this kind of thing. As far as I can
remember, without checking, there is no facility for an upper case
character to have more than one corresponding lower case character (and
vice versa).

I can't actually see the characters you show above, but you mention
accents. Unicode has an accented Alpha at U+386 and an accented alpha at
U+3AC, so I assume those are what you are referring to.

My understanding is that, in other languages at least, accented
characters are not considered the same as unaccented for this purpose.
Surely if you were converting a word from upper case to lower case, you
would want to preserve accents? If you had a choice of lower case
characters, which would you use?

It sounds to me as though what you need is a pre-pass function that does
character conversions on the subject string before it is passed to PCRE
(e.g. converts all accented characters to unaccented). I do not think it
makes sense to add code to PCRE itself that is specific to any one
character set (e.g. Greek) because there are just too many different
languages that could all need different special processing. Better keep
that kind of logic outside.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Dimitrios at

Re: [pcre-dev] PCRE is not case insensitive with Greek chara…