Re: [pcre-dev] Getting PCRE to accept *any* valid letter

Author: Philip Hazel
Date:
To: Patrick Questembert
CC: pcre-dev
Subject: Re: [pcre-dev] Getting PCRE to accept *any* valid letter

On Tue, 3 May 2011, Patrick Questembert wrote:

> I am using the PCRE library as part of the iPhone & Android ScanBizCards
> business card reader - many thanks for a vital product!

It's always nice to hear about satisfied "customers"!

> We support 22 scanning languages and so we need PCRE to accept letters in
> all these languages (French, German, Turkish etc). The primary needs are
> for:
> - \w to correctly treat letters as such
> - :i to properly recognize the uppercase / lowercase correspondence of
> letters

Which release of PCRE? The ChangeLog for 8.10 has this entry:

9. Added PCRE_UCP to make \b, \d, \s, \w, and certain POSIX character
classes use Unicode properties. (*UCP) at the start of a pattern can
be used to set this option. Modified pcretest to add /W to test this
facility. Added REG_UCP to make it available via the POSIX interface.

> Neither are working in the default configuration and I started digging into
> the issues - only to find rather complex manners of building "tables" for
> what constitutes characters in various locales.

Locales are sooo 20th century... :-) Seriously, I think the world is
rapidly moving away from locales and using Unicode instead. Then you can
have French, German, Turkish, etc. characters *in the same string*.

PCRE supports Unicode encoded as UTF-8 and, as of release 8.10, does
allow you to make \w do what you want in UTF-8 mode. As for making
PCRE_CASELESS behave similarly, that was introduced even longer ago for
UTF-8 mode. Check out the PCRE_UTF8 option flag.

> I have simple functions elsewhere in my code (isLetter, isLower, isUpper,
> toLower, toUpper) doing precisely that - if PCRE doesn't have a generic
> mode per the above could you advise in which source file(s) I would need to
> insert my functions to obtain the desired generic character acceptance?

isLetter() etc. take integers as arguments, and so work fine with
character values greater than 255.

Now, if you are indeed working with locales instead of Unicode, that is,
restricting yourself to only 256 character code points, I'm afraid
you'll have to tangle with the creation of appropriate tables for each
locale, because the same code value (in the range 128-255) can be
different things in different locales. If you try to do it by inserting
code into PCRE, how do you define which characters are letters? You'll
have to have your own tables, so why not use PCRE's existing code? Then
you won't have to upgrade for each PCRE release.

PCRE comes with all the necessary apparatus for building tables in the
current locale. Check out the maketables.c and dftables.c source files.
However, I would seriously advise you to look at Unicode if you are not
already using it.

I hope this helps.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Patrick Questembert at
	Patrick Questembert at