https://bugs.exim.org/show_bug.cgi?id=2131
Bug ID: 2131
Summary: Support Unicode Collation Algorithm Matching
Product: PCRE
Version: N/A
Hardware: All
OS: All
Status: NEW
Severity: wishlist
Priority: medium
Component: Code
Assignee: ph10@???
Reporter: mackyle@???
CC: pcre-dev@???
When using the Unicode Collation Algorithm (UCA) for matching,
the precomposed e-grave (U+00E8 UTF8: \xC3\xA8) matches the
decomposed sequence e+grave (U+0065 U+0300 UTF8: \x65\xCC\x80).
The International Components for Unicode (ICU) library supports
this style of matching (apparently always according to the
mention in Unicode Technical Note (UTN) #5 -- see the beginning
of section "6 Sample Implementation") using the "FCD" approach
as described in UTN #5.
Although the "P" in "PCRE" stands for Perl, and Perl does not
directly support this, it does provide NFC/NFKC/NFD/NFKD functions
that are readily available as part of the Perl core. For example:
#!/usr/bin/perl
use Unicode::Normalize;
$\ = "\n";
my $str1 = "\xE8"; # precomposed e-grave
my $str2 = "e\x{300}"; # decomposed e+grave
print "before NFC: ", ($str1 eq $str2) ? 1 : 0;
print "after NFC: ", (NFC($str1) eq NFC($str2)) ? 1 : 0;
So it's not difficult to achieve this style of matching in Perl
if desired, but there does not seem to be such an option for PCRE.
UTN #5 discusses approaches for doing this in an efficient manner.
UTN #5 points out that it's generally almost trivial to change an
NFC-providing routine into an FCC-providing one, FCC strings are
always FCD strings and FCD strings can be used to efficiently perform
UCA matching.
The sample code provided in Unicode Standard Annex (UAX) #15 can be
modified in this way.
The utf8proc project provides a compact MIT-style licensed C library
that can perform NFC/NFKC/NFD/NFKD transformations and may be somewhat
more efficient than UAX #15's sample code since it's not "sample code."
If support for UCA matching should become availale in PCRE at some
point, it would also be nice if it was available via the pcreposix.h
interface, perhaps via a new REG_UCA option bit.
Unicode Technical Standard #10: Unicode Collation Algorithm:
http://www.unicode.org/reports/tr10/
Unicode Standard Annex #15: Unicode Normalization Forms
http://www.unicode.org/reports/tr15/
Unicode Technical Note #5: Canonical Equivalence in Applications:
http://www.unicode.org/notes/tn5/
utf8proc repository:
https://github.com/JuliaLang/utf8proc
utf8proc project page:
https://julialang.org/utf8proc/
--Kyle
--
You are receiving this mail because:
You are on the CC list for the bug.