[pcre-dev] [Bug 2131] Support Unicode Collation Algorithm Ma…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2131] Support Unicode Collation Algorithm Matching
https://bugs.exim.org/show_bug.cgi?id=2131

Philip Hazel <ph10@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |WONTFIX
             Status|NEW                         |RESOLVED


--- Comment #1 from Philip Hazel <ph10@???> ---
There are philosophical and practical problems here. I have always argued that
PCRE is a processor of strings of characters, with very little interpretation
of how successive characters might interact with each other (minor exceptions
are CRLF and \X sequences). I also feel that you should be able to look for
U+00E8 and find *only* the U+00E8 character, not the U+0065/U+0300 pair. After
all, you may actually be interested in searching for these kinds of differences
of usage in a text.

The practical issue is that when one regex item might match a variable number
of code points, the length of the matched string is unknown. This makes
lookbehinds impossible, at least in the way that they are implemented in PCRE2,
which is to move back N characters and then match forwards.

If an application is dealing with texts that have a mixture of ways of
representing (for example) accented characters, and wants a simple way of
searching for them, it can normalize the text (and possibly also the pattern)
beforehand. It looks as if that's what your Perl examples do, and as there's
already a library available from the utf8proc project it should be
straightforward for other applications. I can't see any advantage of packaging
this inside PCRE2. I'm going to mark this issue "won't fix".

--
You are receiving this mail because:
You are on the CC list for the bug.