[pcre-dev] Ignoring a whole set of unicode characters

Top Page
Delete this message
Author: Ze'ev Atlas
Date:  
To: Philip Hazel, Pcre Exim
Subject: [pcre-dev] Ignoring a whole set of unicode characters
I would like to ask whether it is possible or implement in PCRE the possibility to tell the regex to ignore a whole set of unicode characters.  Let me please explain, in some languages, Arabic and Hebrew are main examples, there are unicode characters that are considered to be vowels.  They do not exist on their own, do not take a space in print but could be applied to the base character.  In most cases, those vowels are dropped and not used at all since the readers know from the context what would they be, but in other cases they are applied.  As opposing to diacritics, they are NEVER considered a part of the base character.Example: code point 05d0 is HEBREW LETTER ALEF, but the two codepoints 05d0,05B8 next to each other are HEBREW LETTER ALEF with HEBREW POINT QAMATS.  Now, in many cases, all we need to search are the base characters without taking care of all possible such combinations.  What I need is a way to tell the regex engine something like: "when you see any of the characters between code point 05a0 and 05c7 (that represent the classes 'Hebrew Cantillation marks', 'Points and punctuation' and 'Puncta extraordinaria') ignore them as if they were not there at all".  Similar functionality could be in Arabic code points 066a through 066d (Punctuation, there are more groups like this.) Ze'ev Atlas