[pcre-dev] [Bug 1208] Case folding in PCRE

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1208] Case folding in PCRE

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208

--- Comment #7 from Philip Hazel <ph10@???> 2012-02-10 12:23:41 ---
On Thu, 9 Feb 2012, Zoltan Herczeg wrote:

> We already have such opcode: \R which matches to (?:\r|\n|\r\n) although it
> cannot be used inside [] ranges.

... and also \R cannot be used in lookbehinds. Let us suppose somebody
writes a fragment of a pattern such as (?<=mass) which at the moment is
ok - it looks behind 4 characters. If we allow ss to match ß (a single
character) we will either have to disallow ss in lookbehinds, or
completely re-implement the way they work. And of course the same thing
the other way round (ß in pattern matching ss in the subject).

I see that that Perl also disallows \R in lookbehinds. However, although
it does do some other special things, it is inconsistent. My version of
Perl 5.12.4 matches U+1f88 caselessly to U+1f80 (one character) or to
U+1f00 U+03b9 (two characters), but if you include U+1f88 in a
lookbehind, it does not complain, but matches only the single character.

> If the number of such letters are relatively low (<5 for example), we
> might able to introduce special opcodes for them (Is there any other
> besides Eszett?).

If we start including accented characters, the list could get quite
long. I have not studied the Unicode document

http://www.unicode.org/reports/tr18/

in great detail, but one of the things it says is this:

At Level 1, caseless matches do not need to handle cases where one
character matches against two.

I think that extending PCRE beyond level 1 would be a lot of work. It
would certainly affect performance (as the Unicode document does in fact
point out).

> The other opcodes would be implemented as character ranges. Similar to
> \h and \v but the list on the list is long and would require manual
> maintenance... Anyway if we decide to add multiple cases thing \h and
> \v should become part of that new system.

Yes, that is certainly a good point, depending on how this is
implemented.

> By the way, this would surely be unexpected for an average PCRE user:
> - "Μ" (uppercase mu) matches "μ" (lowercase mu);

And likewise for "ss" matching U+00DF even without case folding.

> Or uppercase mu has a different codepoint like different beta characters?

There is an upper case μ, it is Μ (U+039C). PCRE already handles this
case just fine.

I suppose that the first thing that might be done is to create a list of
all the special cases to see how many there are, before deciding what,
if anything, to do. However, I fear that the list will in fact turn out
to be rather large because of all the character+accent combinations such
as the U+1f88 example quoted above. Ah. Silly me. No need to do this;
the Unicode people have already done it:

http://www.unicode.org/Public/6.1.0/ucd/SpecialCasing.txt

This file, which is not small, is concerned with case folding. It does
not specify that ß should match "ss", for example, just that, when
casefolded, it should match "SS". The file is sufficiently large that it
would need to be processed into some kind of efficient indexed format
before being used, and I would certainly want as much of the work as
possible to be done at compile time. So yes, new opcodes would have to
be invented.

I would still be unhappy about having to dis-allow ß in lookbehinds,
even if only when case folding.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Petr Pisar at
	Philip Hazel at