[pcre-dev] [Bug 2131] New: Support Unicode Collation Algorithm Matching

Author: admin
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 2131] New: Support Unicode Collation Algorithm Matching

https://bugs.exim.org/show_bug.cgi?id=2131

            Bug ID: 2131
           Summary: Support Unicode Collation Algorithm Matching
           Product: PCRE
           Version: N/A
          Hardware: All
                OS: All
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
          Assignee: ph10@???
          Reporter: mackyle@???
                CC: pcre-dev@???

When using the Unicode Collation Algorithm (UCA) for matching,
the precomposed e-grave (U+00E8 UTF8: \xC3\xA8) matches the
decomposed sequence e+grave (U+0065 U+0300 UTF8: \x65\xCC\x80).

The International Components for Unicode (ICU) library supports
this style of matching (apparently always according to the
mention in Unicode Technical Note (UTN) #5 -- see the beginning
of section "6 Sample Implementation") using the "FCD" approach
as described in UTN #5.

Although the "P" in "PCRE" stands for Perl, and Perl does not
directly support this, it does provide NFC/NFKC/NFD/NFKD functions
that are readily available as part of the Perl core. For example:

    #!/usr/bin/perl
    use Unicode::Normalize;

    $\ = "\n";
    my $str1 = "\xE8"; # precomposed e-grave
    my $str2 = "e\x{300}"; # decomposed e+grave
    print "before NFC: ", ($str1 eq $str2) ? 1 : 0;
    print "after NFC: ", (NFC($str1) eq NFC($str2)) ? 1 : 0;

So it's not difficult to achieve this style of matching in Perl
if desired, but there does not seem to be such an option for PCRE.

UTN #5 discusses approaches for doing this in an efficient manner.

UTN #5 points out that it's generally almost trivial to change an
NFC-providing routine into an FCC-providing one, FCC strings are
always FCD strings and FCD strings can be used to efficiently perform
UCA matching.

The sample code provided in Unicode Standard Annex (UAX) #15 can be
modified in this way.

The utf8proc project provides a compact MIT-style licensed C library
that can perform NFC/NFKC/NFD/NFKD transformations and may be somewhat
more efficient than UAX #15's sample code since it's not "sample code."

If support for UCA matching should become availale in PCRE at some
point, it would also be nice if it was available via the pcreposix.h
interface, perhaps via a new REG_UCA option bit.

Unicode Technical Standard #10: Unicode Collation Algorithm:

    http://www.unicode.org/reports/tr10/

Unicode Standard Annex #15: Unicode Normalization Forms

    http://www.unicode.org/reports/tr15/

Unicode Technical Note #5: Canonical Equivalence in Applications:

    http://www.unicode.org/notes/tn5/

utf8proc repository:

    https://github.com/JuliaLang/utf8proc

utf8proc project page:

    https://julialang.org/utf8proc/

--Kyle

--
You are receiving this mail because:
You are on the CC list for the bug.

This message is part of the following thread:
	the complete thread tree sorted by date

	admin at

[pcre-dev] [Bug 2131] New: Support Unicode Collation Algorit…