[pcre-dev] Locale-aware (Turkish) Unicode caseless matching

Top Page
Delete this message
Author: Giuseppe D'Angelo
Date:  
To: pcre-dev
Subject: [pcre-dev] Locale-aware (Turkish) Unicode caseless matching
Hello,

I'm doing a bit of research whether PCRE properly supports Turkish
caseless matching.

The Turkish language features dotted and dotless "i" letters:

* ı (U+0131, LATIN SMALL LETTER DOTLESS I)
* I (U+0049, LATIN CAPITAL LETTER I)
* i (U+0069, LATIN SMALL LETTER I)
* İ (U+0130, LATIN CAPITAL LETTER I WITH DOT ABOVE)

The dotless versions should never match against the the dotted
versions, and viceversa.

Of course (pun), the Unicode consortium decided not to introduce new
code points for representing the Turkish "i", so it normally follows
the ordinary Unicode (Latin) rules for casing and with PCRE it matches
the capital "I".

I can't find a way to change this behaviour, though; do you think it
would be feasable to add a pcre_compile / pcre_exec option switch that
sets some exceptions up? This is more or less what other APIs do, f.i.
Win32's CompareStringEx [1] features a
NORM_LINGUISTIC_CASING option that turns on Turkish-aware casing.

Cheers,
--
Giuseppe D'Angelo

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx