------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1179
Summary: Missed utf8 caseless matches
Product: PCRE
Version: 8.20
Platform: Other
OS/Version: Linux
Status: NEW
Severity: bug
Priority: low
Component: Code
AssignedTo: ph10@???
ReportedBy: acoyte@???
CC: pcre-dev@???
It appears a small number of codepoints are only matched caselessly in the
subject string when followed by more bytes. I suspect that the issue is related
to the fact that the other cased form of the code point requires fewer bytes to
encode in utf8.
For example, LATIN SMALL LETTER A WITH STROKE (\x{2c65}) should match
caselessly against LATIN CAPITAL LETTER A WITH STROKE (\x{23a}). However, it
only seems to match when there is at least one more byte in the subject string:
PCRE version 8.20 2011-10-21
re> /ⱥ/8i
------------------------------------------------------------------
0 7 Bra
3 /i \x{2c65}
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char
data> ⱥ
0: \x{2c65}
data> Ⱥ
No match
data> Ⱥ_
0: \x{23a}
Interestingly, things work fine when in a character class:
re> /[ⱥ]/8i
------------------------------------------------------------------
0 15 Bra
3 [\x{2c65}\x{23a}]
15 15 Ket
18 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char
data> ⱥ
0: \x{2c65}
data> Ⱥ
0: \x{23a}
This is happening on trunk (rev 765) as well as 8.20.
Other codepoints which seem to affected include:
LATIN CAPITAL LETTER I WITH DOT ABOVE (\x{130})
LATIN SMALL LETTER DOTLESS I (\x{131})
LATIN CAPITAL LETTER SHARP S (\x{1e9e})
GREEK PROSGEGRAMMENI (\x{1fbe})
OHM SIGN (\x{2126})
KELVIN SIGN (\x{212a})
ANGSTROM SIGN (\x{212b})
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email