Re: [pcre-dev] Ignoring a whole set of unicode characters

Top Page
Delete this message
Author: Ze'ev Atlas
Date:  
To: pcre-dev@exim.org
Subject: Re: [pcre-dev] Ignoring a whole set of unicode characters
Hi PhilipI consider that to be a major issue for non-European languages, at least those that distinguish between the basic consonant letters and additional characters that may add information but are not necessary (I gave you to scripts in which such characters would be implied whether they are there or not.)I know that you abide by the Perl 'standard', so I guess my question would be where do I go to propose such a change, let's say something like (similar to /i for ignore case in Perl):
I use D for 'Disregard'if $x=~/some elaborate pattern with Unicode characters/D{UNICODE CLASS NAME}) #  or D{list of code points}
which will mean:match the pattern /some elaborate pattern with Unicode characters/, but whenever you see a disregarded character behave like it was not there at all!
Another possibility would be to tell that list to the engine in advance via some general parameter.  What I won't want is to have that as a compile option to the C when it compiles the engine.How hard would it be to implement something like this?
 Ze'ev Atlas


      From: "ph10@???" <ph10@???>



I am not expert on this kind of thing, but doesn't \X do some of what
you want? It will, for example, match the pair 05d0,05b8. It matches
what Unicode calls an "Extended Grapheme Cluster". See the description
in the pcre[2]pattern page for details.

If you wanted to match 05d0 plus any following mark characters you could
write this:  (?=\x{5d0})\X  which is clumsy, but I don't know of
anything neater.

Philip

--
Philip Hazel