[pcre-dev] Locale-aware (Turkish) Unicode caseless matching

Author: Giuseppe D'Angelo
Date:
To: pcre-dev
Subject: [pcre-dev] Locale-aware (Turkish) Unicode caseless matching

Hello,

I'm doing a bit of research whether PCRE properly supports Turkish
caseless matching.

The Turkish language features dotted and dotless "i" letters:

* ı (U+0131, LATIN SMALL LETTER DOTLESS I)
* I (U+0049, LATIN CAPITAL LETTER I)
* i (U+0069, LATIN SMALL LETTER I)
* İ (U+0130, LATIN CAPITAL LETTER I WITH DOT ABOVE)

The dotless versions should never match against the the dotted
versions, and viceversa.

Of course (pun), the Unicode consortium decided not to introduce new
code points for representing the Turkish "i", so it normally follows
the ordinary Unicode (Latin) rules for casing and with PCRE it matches
the capital "I".

I can't find a way to change this behaviour, though; do you think it
would be feasable to add a pcre_compile / pcre_exec option switch that
sets some exceptions up? This is more or less what other APIs do, f.i.
Win32's CompareStringEx [1] features a
NORM_LINGUISTIC_CASING option that turns on Turkish-aware casing.

Cheers,
--
Giuseppe D'Angelo

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx

This message is part of the following thread:
	the complete thread tree sorted by date

	Giuseppe D'Angelo at