[pcre-dev] [Bug 1208] Case folding in PCRE

Author: Giuseppe D'Angelo
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1208] Case folding in PCRE

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208

Giuseppe D'Angelo <dangelog@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dangelog@???

--- Comment #11 from Giuseppe D'Angelo <dangelog@???> 2012-02-10 21:15:46 ---
(In reply to comment #3)
> To tell you the truth this issue has not been raised since I am here, so I
> never estimated the required efforts. Is there any example implementation for
> it (for UTF caseless compare)?

The general rules of caseless string comparison are available here in section
3.13 (see f.i. Default Caseless Matching):
http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
In short, it's a matter of casefolding the two strings, plus eventually rounds
of normalizations.

(In reply to comment #4)
> I agree with Zoltan that this is a difficult area, and I have kept well clear
> of it in the past. Consider this: you say
>
> For instance, "ß" (U+00DF LATIN SMALL LETTER SHARP S) should match "ss"
>
> I take it that you mean a U+00DF in the pattern should match "ss" in the
> subject string. But, should "ss" in the pattern match U+00DF in the subject
> string also?

For the argument's sake, let's put special regexp symbols out of the equation:
if ß matches "SS" case insensitively, then "SS" should match ß as well. In
other words, it's logical to expect that both
"straße" =~ /STRASSE/i
and
"STRASSE" =~ /straße/i
match.

> And what about a single accented character versus an unaccented character
> followed by an accent? This seems to be adding quite a lot of "semantics" to
> the business of character string matching.

Indeed. That's point 2.1 here
http://www.unicode.org/reports/tr18/#Extended_Unicode_Support

> As Zoltan says, PCRE uses relatively compact tables, and it implements only
> one-to-one case mappings. Changing this takes us to a whole new ballpark. I
> guess it would involve a second table of some kind, and it would play havoc
> with certain operations such as lookbehind, which rely on knowing, at compile
> time, how many characters to go back. Having to check another table for each
> character (even when not case folding) would no doubt slow things down
> somewhat.

I'm not sure it requires a lookup for each character. I'm no Unicode expert,
but it seems to me that it's a kind of preprocessing to do on both the pattern
and subject string in order to be able to do binary comparisons then.

> Perhaps the solution might be to check the table at compile time and
> compile a different opcode for characters that require additional testing,
> though I am not really happy about this and I'm not sure how it would work for
> character pairs such as "ss". The feature could, of course, be optional, which
> would save everybody who is matching English wasting time checking for the
> German Eszett.

Indeed.

> I don't know how Perl handles this. I guess one of us should check.

It doesn't according to the perlunicode page (see RL2.1)

http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

(In reply to comment #5)
> We already have such opcode: \R which matches to (?:\r|\n|\r\n) although it
> cannot be used inside [] ranges. If the number of such letters are relatively
> low (<5 for example), we might able to introduce special opcodes for them (Is
> there any other besides Eszett?).

Basically the ones with "F" status listed here:
http://www.unicode.org/Public/6.1.0/ucd/CaseFolding.txt

> The other opcodes would be implemented as
> character ranges. Similar to \h and \v but the list on the list is long and
> would require manual maintenance... Anyway if we decide to add multiple cases
> thing \h and \v should become part of that new system.
>
> By the way, this would surely be unexpected for an average PCRE user:
> - "Μ" (uppercase mu) matches "μ" (lowercase mu);
>
> Or uppercase mu has a different codepoint like different beta characters?

It has! That was Μ (U+039C, GREEK CAPITAL LETTER MU), and the fact it matches
is no surprise. What I was pointing out was the inconsistency when doing case
insensitive matching with PCRE 8.30:

"µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU)
but
"Μ" (U+039C, GREEK CAPITAL LETTER MU) DOES NOT match "µ" (U+00B5, MICRO
SIGN)

as well as:

"µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU)
"μ" (U+03BC, GREEK SMALL LETTER MU) matches "Μ" (U+039C, GREEK CAPITAL
LETTER MU)
but
"µ" (U+00B5, MICRO SIGN) DOES NOT match "μ" (U+03BC, GREEK SMALL LETTER MU)
(and viceversa).

(In reply to comment #6)
> I'm not sure if this is relevant and how Unicode defines it, but Greek has two
> variants of lower-case sigma (σ, ς) while there is only one upper-case Σ. If
> I were a Greek I would like to match bot of them.

It's the same situation here (again, testing PCRE 8.30):

- "σ" matches "Σ"
- "Σ" matches "σ"
- "ς" matches "Σ"
- "Σ" DOES NOT match "ς" (?)
And: "σ" and "ς" do not match each other.

This is solvable by case folding, although it opens many other implementation
problems.
For instance, does it make sense to have "ß" =~ /(S)(S)/i ? What should $1 and
$2 contain then?

I have the feeling this discussion is shifting towards PCRE implementing
Unicode regexps Level 2/3. :-)

Cheers,
Giuseppe D'Angelo

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Petr Pisar at
	Zoltan Herczeg at