Re: [pcre-dev] Proposal for a new API for PCRE

Top Page
Delete this message
Author: jcd
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Proposal for a new API for PCRE

>Yes, it was this list. The solution should be to create a new "Turkish
>dotted i" character, distinct from the ASCII i, that can then have its
>own behaviour. Whether this will ever happen is, of course, another
>matter.


Very unlikely as I see it. It would also require au uppercase dotless I
distinct from the ASCII I.

> > Another issue is with character classes ranges. You would think
> that [a-k] and
> > [x-z] are well defined for latin scripts but that's not the case.
>
>They are certainly well defined, being equivalent to [\x61-\x6b] and
>[\x78-\x7a], that is, related to code order, not cultural sorting order.


Sorry, bad wording on my side. I should have said "have the same
meaning for all latin scripts from a user's point of view".

>However, given that different languages have different sorting orders,
>how would you expect to process a document that contains more than one
>language, for example, both English and Lithuanian? At least Unicode
>allows you to encode all the necessary characters of multiple languages.


Close to impossible. Even using Unicode and even using UCA (the Unicode
collation algorithm). That needs a brain.

> > Would a more compact way be possible, for instance something like
> > \p{Cyrillic|Greek} for OR, \p{Devanagari+Nd} for AND and inner
> parenthesis for
> > more complex conditions like \p{Lu+(Cyrillic|Greek)}?
>
>As far as I know, Perl doesn't have anything like this. Do any other
>regex engines?


I sincerely have no idea, but probably not. I just thought it was the
missing stone on top of UCP, which is already by itself a formidable
step forward for users manipulating non-english text.


--
<mailto:jcd@q-e-d.org>jcd@???