Re: [pcre-dev] Ignoring a whole set of unicode characters

Top Page
Delete this message
Author: ph10
Date:  
To: Ze'ev Atlas
CC: Pcre Exim
Subject: Re: [pcre-dev] Ignoring a whole set of unicode characters
On Thu, 26 Mar 2015, Ze'ev Atlas wrote:

> I would like to ask whether it is possible or implement in PCRE the
> possibility to tell the regex to ignore a whole set of unicode
> characters.  Let me please explain, in some languages, Arabic and
> Hebrew are main examples, there are unicode characters that are
> considered to be vowels.  They do not exist on their own, do not take
> a space in print but could be applied to the base character.  In most
> cases, those vowels are dropped and not used at all since the readers
> know from the context what would they be, but in other cases they are
> applied.  As opposing to diacritics, they are NEVER considered a part
> of the base character.Example: code point 05d0 is HEBREW LETTER ALEF,
> but the two codepoints 05d0,05B8 next to each other are HEBREW LETTER
> ALEF with HEBREW POINT QAMATS.  Now, in many cases, all we need to
> search are the base characters without taking care of all possible
> such combinations.  What I need is a way to tell the regex engine
> something like: "when you see any of the characters between code point
> 05a0 and 05c7 (that represent the classes 'Hebrew Cantillation marks',
> 'Points and punctuation' and 'Puncta extraordinaria') ignore them as
> if they were not there at all".  Similar functionality could be in
> Arabic code points 066a through 066d (Punctuation, there are more
> groups like this.) Ze'ev Atlas


I am not expert on this kind of thing, but doesn't \X do some of what
you want? It will, for example, match the pair 05d0,05b8. It matches
what Unicode calls an "Extended Grapheme Cluster". See the description
in the pcre[2]pattern page for details.

If you wanted to match 05d0 plus any following mark characters you could
write this: (?=\x{5d0})\X which is clumsy, but I don't know of
anything neater.

Philip

--
Philip Hazel