Re: [pcre-dev] Proposal for a new API for PCRE

Top Page
Delete this message
Author: ph10
Date:  
To: jcd
CC: pcre-dev
Subject: Re: [pcre-dev] Proposal for a new API for PCRE
On Fri, 30 Aug 2013, jcd wrote:

> There are a number of unhandled exceptions which are a real pain to deal with.


Indeed; I should have said that Unicode is not perfect, but it is (IMHO)
an improvement on code pages. I was giving an argument for not doing any
"code page" extensions to PCRE, rather than trying to say Unicode solves
all problems.

> The turkish dotted vs. dotless i for instance is an easy example. I seem to
> recall that there was a thread on this few months ago, or maybe on another
> list.


Yes, it was this list. The solution should be to create a new "Turkish
dotted i" character, distinct from the ASCII i, that can then have its
own behaviour. Whether this will ever happen is, of course, another
matter.

> Another issue is with character classes ranges. You would think that [a-k] and
> [x-z] are well defined for latin scripts but that's not the case.


They are certainly well defined, being equivalent to [\x61-\x6b] and
[\x78-\x7a], that is, related to code order, not cultural sorting order.
I can see that this is confusing for users of Lithuanian and other
languages where sorting order is different to code order.

However, given that different languages have different sorting orders,
how would you expect to process a document that contains more than one
language, for example, both English and Lithuanian? At least Unicode
allows you to encode all the necessary characters of multiple languages.

> BTW I'd like to ask whether it would be possible to have a compact way of
> ANDing and ORing together several category properties. NOT is built-in with
> \P{}. OR can be done with alternation in look-ahead. AND can also be done by
> successive look-aheads. But these constructs are very cumbersome.
>
> Would a more compact way be possible, for instance something like
> \p{Cyrillic|Greek} for OR, \p{Devanagari+Nd} for AND and inner parenthesis for
> more complex conditions like \p{Lu+(Cyrillic|Greek)}?


As far as I know, Perl doesn't have anything like this. Do any other
regex engines? I have made a note of your suggestion, but writing a
parser for such expressions isn't just a few lines of code, so it
definitely won't happen soon.

Thanks for your comments.

Philip

--
Philip Hazel