[pcre-dev] [Bug 1208] Case folding in PCRE

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1208] Case folding in PCRE
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208




--- Comment #14 from Philip Hazel <ph10@???> 2012-02-11 11:44:10 ---
On Fri, 10 Feb 2012, Zoltan Herczeg wrote:

> I agree that we should start with single characters, and later we may try to
> address SS.


ß matching ss and SS could be handled by a special opcode like \R. Are
there any other cases like this? I suppose I should try to find time to
study the Unicode documents in detail.

> I would prefer one opcode for all such cases: OP_MULTICHAR IMM2
> pair, which points a location in the multi-character table. The table would be
> an int array, which would be a concatenation of sub-lists. A sub-list starts
> with its length, followed by the characters which are equal in caseless form.
> This list could also contain \h and \v as well.


That's good because it does the work of searching for the character at
compile time. I assume that by "points to a location in" you mean
"contains an offset in".

> (σ, ς) -> Σ question: Does a caseless σ matches to ς?


As I understand it, ς is a "final sigma", appearing only at the ends
of words, whereas σ appears in other positions. So I would say the
answer to your question is "no", but I am not an expert.

On Fri, 10 Feb 2012, Petr Pisar wrote:

> (In reply to comment #9)
> > (σ, ς) -> Σ question: Does a caseless σ matches to ς?
> >
> I don't know. I think case of search pattern is insignificant in caseless
> matching mode, so if Σ matches {σ, ς}, and {σ, ς} match Σ, then {σ, ς}
> should match {σ, ς}. I feel the matching should be equivalence relation.


Perhaps we should find a Greek language expert to ask? (I don't know if
the Unicode documents cover this; they might.)

On Fri, 10 Feb 2012, Giuseppe D'Angelo wrote:

> For the argument's sake, let's put special regexp symbols out of the equation:
> if ß matches "SS" case insensitively, then "SS" should match ß as well. In
> other words, it's logical to expect that both
> "straße" =~ /STRASSE/i
> and
> "STRASSE" =~ /straße/i
> match.


When dealing with German text, perhaps yes. For non-German text, I would
not expect

"maßive" =~ /MASSIVE/i

to match. (This feels like just the kind of thing that might lead to
security exposures - probably most people in the world are not even
aware of the existence of the ß character.)

>From what I read of the Unicode documents (and I admit I have only

skimmed them so far) I seem to remember that they do not suggest that

"straße" =~ /STRASSE/i

should match; just the other way round.

> > I don't know how Perl handles this. I guess one of us should check.
>
> It doesn't according to the perlunicode page (see RL2.1)
>
> http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level


So if PCRE does any of this, it is going beyond Perl. Not that that
would be unprecedented.

> Basically the ones with "F" status listed here:
> http://www.unicode.org/Public/6.1.0/ucd/CaseFolding.txt
>
> "µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU)
> but
> "Μ" (U+039C, GREEK CAPITAL LETTER MU) DOES NOT match "µ" (U+00B5, MICRO
> SIGN)


This must be because the Unicode table UnicodeData.txt has that data in
it. Indeed, here it is:

00B5;MICRO SIGN;Ll;0;L;<compat> 03BC;;;;N;;;039C;;039C
039C;GREEK CAPITAL LETTER MU;Lu;0;L;;;;;N;;;;03BC;

I see that U+039C is not mentioned in the SpecialCasing.txt file.

> "µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU)
> "μ" (U+03BC, GREEK SMALL LETTER MU) matches "Μ" (U+039C, GREEK CAPITAL
> LETTER MU)
> but
> "µ" (U+00B5, MICRO SIGN) DOES NOT match "μ" (U+03BC, GREEK SMALL LETTER MU)
> (and viceversa).


It's a mess, isn't it? Personally, I don't think there should be an
upper case version of U+00B5 micro sign, because it isn't a letter (just
like there isn't an upper case version of a dollar sign).

> It's the same situation here (again, testing PCRE 8.30):
>
> - "σ" matches "Σ"
> - "Σ" matches "σ"
> - "ς" matches "Σ"
> - "Σ" DOES NOT match "ς" (?)
> And: "σ" and "ς" do not match each other.


Again, these are all a consequence of what is defined by
UnicodeData.txt.

> This is solvable by case folding, although it opens many other implementation
> problems.
> For instance, does it make sense to have "ß" =~ /(S)(S)/i ? What should $1 and
> $2 contain then?


Quite!

> I have the feeling this discussion is shifting towards PCRE implementing
> Unicode regexps Level 2/3. :-)


Or even level 4? :-)

On Fri, 10 Feb 2012, Zoltan Herczeg wrote:

> > "straße" =~ /STRASSE/i
>
> Hm, this can be ambiguous in some cases:
>
> "SS" =~ /[ßS]/i
>
> What is the supposed result and why does it take precedence?


Good question! In the case of \R, the longer string takes precedence. In
other words, it only matches \r on its own when \r is not followed by
\n. In this example, I have no idea what should happen.

On Fri, 10 Feb 2012, Giuseppe D'Angelo wrote:

> --- Comment #13 from Giuseppe D'Angelo <dangelog@???> 2012-02-10 23:16:01 ---
> For discussion, some additional material from Perl 5.14:
> http://98.245.80.27/tcpc/OSCON2011


Help! It's a horribly complicated minefield.

I am not at all sure I wish to venture any further than Simple case
folding, with perhaps the addition of some features of
SpecialCasing.txt.

I think we should consider this issue very carefully for a while before
even trying to implement anything. If/when there is implementation, I
would like to avoid having to maintain local tables if possible, and use
only tables derived automatically from the Unicode files, in the same
way that the table in pcre_ucd.c and some of the tables in pcre_tables.c
are currently created.

As well as waiting for 8.30 to settle down before starting anything big,
there are in fact several other issues "in the queue" for thinking about
(Zoltán, I going to write to you about them soon). That is another
reason for not rushing ahead immediately.

For my part, I will do my best to study the Unicode documents and the
Perl documents to try to understand better what they are getting at.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email