[pcre-dev] [Bug 1208] Case folding in PCRE

Top Page

Reply to this message
Author: Giuseppe D'Angelo
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1208] Case folding in PCRE
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208




--- Comment #2 from Giuseppe D'Angelo <dangelog@???> 2012-02-08 23:18:23 ---
(In reply to comment #1)
> I read about this issues in a Unicode forum before.
>
> For example ß can be a greek letter and it does not match ss in that case. Or
> a mathematic symbol which has no othercase at all... There are some complicated
> cases in the arabian languages as well. As far as I know there is no nice way
> to handle these cases, too many exceptions, and exceptions of exceptions...


No, it can't: the Eszett (U+00DF) is not the Greek lowercase beta (U+03B2),
which is again different from the Beta symbol (U+03D0) and from the various
mathematical betas (U+1D6FD and co.).

The folding specified for the Eszett and the betas is different (in particular,
the mathematical ones have no folding at all):
http://www.unicode.org/Public/6.1.0/ucd/CaseFolding.txt

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
03D0; C; 03B2; # GREEK BETA SYMBOL
0392; C; 03B2; # GREEK CAPITAL LETTER BETA

In other words: ß folds to "ss", and both the symbol and the capital beta to
the lowercase beta β. So after folding, they don't match each other.

The problem is, for instance, that when doing case insensitive matches:

- "μ" (lowercase mu) matches "Μ" (uppercase mu);
- "Μ" (uppercase mu) matches "μ" (lowercase mu);

- "µ" (micro) matches "Μ" (uppercase mu);
- "Μ" (uppercase mu) DOES NOT match "µ" (micro);

- "µ" (micro) DOES NOT match "μ" (lowercase mu).

That's not expected at all, nor does it follow the transitive logic.

> At the moment we simply use the unicode othercase / type tables. They have an
> acceptable library size increase and defines the common behaviour.


Fair enough, but that's why I was asking if you were planning to do full
casefolding...

Cheers,
Giuseppe D'Angelo


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email