[pcre-dev] [Bug 1179] New: Missed utf8 caseless matches

Top Page
Delete this message
Author: Alex Coyte
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1179] New: Missed utf8 caseless matches
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1179
           Summary: Missed utf8 caseless matches
           Product: PCRE
           Version: 8.20
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: bug
          Priority: low
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: acoyte@???
                CC: pcre-dev@???



It appears a small number of codepoints are only matched caselessly in the
subject string when followed by more bytes. I suspect that the issue is related
to the fact that the other cased form of the code point requires fewer bytes to
encode in utf8.

For example, LATIN SMALL LETTER A WITH STROKE (\x{2c65}) should match
caselessly against LATIN CAPITAL LETTER A WITH STROKE (\x{23a}). However, it
only seems to match when there is at least one more byte in the subject string:

PCRE version 8.20 2011-10-21

re> /ⱥ/8i

------------------------------------------------------------------
  0   7 Bra
  3  /i \x{2c65}
  7   7 Ket
 10     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char

data> ⱥ

0: \x{2c65}
data> Ⱥ

No match
data> Ⱥ_

0: \x{23a}

Interestingly, things work fine when in a character class:

re> /[ⱥ]/8i

------------------------------------------------------------------
  0  15 Bra
  3     [\x{2c65}\x{23a}]
 15  15 Ket
 18     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char

data> ⱥ

0: \x{2c65}
data> Ⱥ

0: \x{23a}

This is happening on trunk (rev 765) as well as 8.20.

Other codepoints which seem to affected include:
LATIN CAPITAL LETTER I WITH DOT ABOVE (\x{130})
LATIN SMALL LETTER DOTLESS I (\x{131})
LATIN CAPITAL LETTER SHARP S (\x{1e9e})
GREEK PROSGEGRAMMENI (\x{1fbe})
OHM SIGN (\x{2126})
KELVIN SIGN (\x{212a})
ANGSTROM SIGN (\x{212b})


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email