[pcre-dev] Locale-aware (Turkish) Unicode caseless matching

トップ ページ
このメッセージを削除
著者: Giuseppe D'Angelo
日付:  
To: pcre-dev
題目: [pcre-dev] Locale-aware (Turkish) Unicode caseless matching
Hello,

I'm doing a bit of research whether PCRE properly supports Turkish
caseless matching.

The Turkish language features dotted and dotless "i" letters:

* ı (U+0131, LATIN SMALL LETTER DOTLESS I)
* I (U+0049, LATIN CAPITAL LETTER I)
* i (U+0069, LATIN SMALL LETTER I)
* İ (U+0130, LATIN CAPITAL LETTER I WITH DOT ABOVE)

The dotless versions should never match against the the dotted
versions, and viceversa.

Of course (pun), the Unicode consortium decided not to introduce new
code points for representing the Turkish "i", so it normally follows
the ordinary Unicode (Latin) rules for casing and with PCRE it matches
the capital "I".

I can't find a way to change this behaviour, though; do you think it
would be feasable to add a pcre_compile / pcre_exec option switch that
sets some exceptions up? This is more or less what other APIs do, f.i.
Win32's CompareStringEx [1] features a
NORM_LINGUISTIC_CASING option that turns on Turkish-aware casing.

Cheers,
--
Giuseppe D'Angelo

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx