Re: [pcre-dev] Ignoring a whole set of unicode characters

Top Page
Delete this message
Author: ph10
Date:  
To: Ze'ev Atlas
CC: pcre-dev@exim.org
Subject: Re: [pcre-dev] Ignoring a whole set of unicode characters
On Thu, 26 Mar 2015, Ze'ev Atlas wrote:

> I consider that to be a major issue for non-European
> languages, at least those that distinguish between the basic consonant
> letters and additional characters that may add information but are not
> necessary (I gave you to scripts in which such characters would be
> implied whether they are there or not.)I know that you abide by the
> Perl 'standard', so I guess my question would be where do I go to
> propose such a change, let's say something like (similar to /i for
> ignore case in Perl):


There are Perl mailing lists, but I can't quote them to you offhand.

> I use D for 'Disregard'if $x=~/some elaborate pattern with Unicode
> characters/D{UNICODE CLASS NAME}) #  or D{list of code points} which
> will mean:match the pattern /some elaborate pattern with Unicode
> characters/, but whenever you see a disregarded character behave like
> it was not there at all!


> Another possibility would be to tell that list to the engine in
> advance via some general parameter.  What I won't want is to have that
> as a compile option to the C when it compiles the engine.How hard
> would it be to implement something like this?


This list is presumably all the Unicode "mark" characters, which PCRE
already knows about.

I'm not sure how hard this would be[*], but I suspect that it would slow
down matching because it would have to do this checking quite often. For
any application, I think it would be faster to make a copy of the
subject, deleting all the unwanted characters, before starting the
match. This would have the benefit of not repeatedly doing the checking
when backtracking occurs during matching.

[*] It might be quite easy, because all non-ASCII characters are
"loaded" by a macro call. However, I would not want there to be a test
for "special mode" inside this call, because of the performance
implications.

Final thought: It might be easier in the DFA matching function, because
that moves along the subject character by character, without
backtracking.

I have put this on the PCRE2 wish list, but do not hold your breath.

Philip

--
Philip Hazel