[pcre-dev] [Bug 1894] In UTF8 Locale Russian Cyrillic [а-я]…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1894] In UTF8 Locale Russian Cyrillic [а-я] range contains only 32 of 33 letters
https://bugs.exim.org/show_bug.cgi?id=1894

Petr Pisar <ppisar@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ppisar@???


--- Comment #1 from Petr Pisar <ppisar@???> ---
The [а-я] range does not mean all Cyrillic symbols. It means Unicode character
from range U+0430 to U+044F. And as you correctly noted, ё is out of the range
(U+0451). Therefore it should not match:

printf '/[а-я]*/8\nещё\n' | pcretest
PCRE version 8.39 2016-06-14

re> data> 0: \x{435}\x{449}
data>


If you extend the range up to ё, it will match:

$ printf '/[а-ё]*/8\nещё\n' | pcretest
PCRE version 8.39 2016-06-14

re> data> 0: \x{435}\x{449}\x{451}
data>


But there is a better way: Instead of Unicode range you can use Unicode script
name. This is because sometimes the Unicode ranges contain characters from
foreign scripts or an unassigned code points:

$ printf '/\p{Cyrillic}*/8\nещё\n' | pcretest
PCRE version 8.39 2016-06-14

re> data> 0: \x{435}\x{449}\x{451}

data>

--
You are receiving this mail because:
You are on the CC list for the bug.