Re: [pcre-dev] Proposal for a new API for PCRE

Top Page
Delete this message
Author: jcd
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Proposal for a new API for PCRE

>The solution to this is of course Unicode, which does away with code
>pages. I get the feeling that gradually everything in the ASCII world
>will move over to Unicode, and code pages will be a thing of the past
>(and of EBCDIC :-). Locales are still useful for specifying how to
>format dates, currency, and so on, but not for character sets.


This is not true for everone, due to the way Unicode groups overlooked
a number of issues and settled for below-par choices. As if they were
short on free codepoints!

There are a number of unhandled exceptions which are a real pain to
deal with.

The turkish dotted vs. dotless i for instance is an easy example. I
seem to recall that there was a thread on this few months ago, or maybe
on another list. Anyway whatever casing option is in effect PCRE will
mix dotted and dotless i (which are distinct letters in Turkish.)

Another issue is with character classes ranges. You would think that
[a-k] and [x-z] are well defined for latin scripts but that's not the
case. Lithuanian places y between i and j so both ranges fail for
Lithuanian strings.

Similarly, Finnish treats W as a variant of V which comes before V:
[a-v] should include w for Finnish strings. Swedish did the same
before the 2006 reform.

Several eastern latin scripts like Hungarian use contractions (dz, cz,
...).

German use two distinct sorting order and ae, oe, ue are ä, ö, ü
respectively in German phonebooks.

Welch uses several contractions as well which need special handling.

In Dutch ij can be a digraph or not, but in crosswords ij fills one
square and ij is considered and taught as the 25th letter of the alphabet.

So while Unicode goes a huge step forward from codepages, it still
isn't without issues, even when considering latin scripts only.


Of course PCRE is in no way responsible for this poor situation and
there is nothing reasonable that can be done to fix these issues, but
we are still far from getting completely rid of locale considerations.


BTW I'd like to ask whether it would be possible to have a compact way
of ANDing and ORing together several category properties. NOT is
built-in with \P{}. OR can be done with alternation in look-ahead. AND
can also be done by successive look-aheads. But these constructs are
very cumbersome.

Would a more compact way be possible, for instance something like
\p{Cyrillic|Greek} for OR, \p{Devanagari+Nd} for AND and inner
parenthesis for more complex conditions like \p{Lu+(Cyrillic|Greek)}?

--
<mailto:jcd@q-e-d.org>jcd@???