[pcre-dev] [Bug 1208] Case folding in PCRE

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1208] Case folding in PCRE

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208

--- Comment #4 from Philip Hazel <ph10@???> 2012-02-09 16:58:33 ---
I agree with Zoltan that this is a difficult area, and I have kept well clear
of it in the past. Consider this: you say

For instance, "ß" (U+00DF LATIN SMALL LETTER SHARP S) should match "ss"

I take it that you mean a U+00DF in the pattern should match "ss" in the
subject string. But, should "ss" in the pattern match U+00DF in the subject
string also?
And what about a single accented character versus an unaccented character
followed by an accent? This seems to be adding quite a lot of "semantics" to
the business of character string matching.

As Zoltan says, PCRE uses relatively compact tables, and it implements only
one-to-one case mappings. Changing this takes us to a whole new ballpark. I
guess it would involve a second table of some kind, and it would play havoc
with certain operations such as lookbehind, which rely on knowing, at compile
time, how many characters to go back. Having to check another table for each
character (even when not case folding) would no doubt slow things down
somewhat. Perhaps the solution might be to check the table at compile time and
compile a different opcode for characters that require additional testing,
though I am not really happy about this and I'm not sure how it would work for
character pairs such as "ss". The feature could, of course, be optional, which
would save everybody who is matching English wasting time checking for the
German Eszett.

I don't know how Perl handles this. I guess one of us should check.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Zoltan Herczeg at
	Zoltan Herczeg at