Re: [pcre-dev] Getting PCRE to accept *any* valid letter

Góra strony
Delete this message
Autor: Patrick Questembert
Data:  
Dla: pcre-dev
Temat: Re: [pcre-dev] Getting PCRE to accept *any* valid letter
Hi Philip,

Delighted to hear back from Mr PCRE himself so quickly! We are using PCRE
7.8 and will update to 8.10 promptly.

Indeed we don't use code pages and I wonder why mankind got started with
pages of length 256 in the first place! Didn't China, France, Germany,
Russia and Greece exist before computers were invented, raising the
possibility that - God forbid - letters from several languages could appear
in a same document? We use PCRE compiled with UTF8 support and pass
UTF8-encoded Unicode characters (returned by Tesseract OCR).

Quick questions:

- so we need to add "(*UCP)" verbatim to the start of each pattern where we
want \w to properly accept any letter in any language? Just wondering why
anyone would NOT want \w to encompass all letters ... perhaps scientific
papers where Greek letters would be symbols in a formula rather than
letters? Even so, it would seem that (*UCP) should be the default no?

- you seem to be saying that all letters were properly recognized caseless
a long time ago? That's not what I am seeing with V7.8. For example:
"koordinatör" (a Turkish job title ....) matches (?i:koordinatör) but
"koordinatÖr" does not. Is it surprising? Does it not fall to reason that
this would fail prior to V8.10 considering that until then ö/Ö were not
treated as letters and thus nobody bothered to lowercase/uppercase them?

May I forward some promo codes for you to play with our Android / iPhone
business card reader? It may after all be the mobile app with the longest
regular expressions (some of which I can't understand even though I wrote
them :-))?

PS: I wanted to also make a couple of requests for the future ...
1. It would be amazingly useful if we could insert "replace" directives
within regexes, e.g. match "(Great Britain|United Kingdom|England)" and if
so, replace with "UK". The documented way to do that is to use capturing
groups and then replace as necessary (and we do that extensively) but it is
often very much more complicated than doing it "in place" within the pattern
2. It would be great to have state variables, e.g. if a specific capturing
group (say detecting "www") was matched with characters before the current
positions, be able to set and test a state variable (say GOT_URL). Apologies
if this is an existing part of the standard which I missed ...

Patrick

On Tue, May 3, 2011 at 2:57 PM, Philip Hazel <ph10@???> wrote:

> On Tue, 3 May 2011, Patrick Questembert wrote:
>
> > I am using the PCRE library as part of the iPhone & Android ScanBizCards
> > business card reader - many thanks for a vital product!
>
> It's always nice to hear about satisfied "customers"!
>
> > We support 22 scanning languages and so we need PCRE to accept letters
> in
> > all these languages (French, German, Turkish etc). The primary needs are
> > for:
> > - \w to correctly treat letters as such
> > - :i to properly recognize the uppercase / lowercase correspondence of
> > letters
>
> Which release of PCRE? The ChangeLog for 8.10 has this entry:
>
> 9. Added PCRE_UCP to make \b, \d, \s, \w, and certain POSIX character
> classes use Unicode properties. (*UCP) at the start of a pattern can
> be used to set this option. Modified pcretest to add /W to test this
> facility. Added REG_UCP to make it available via the POSIX interface.
>
> > Neither are working in the default configuration and I started digging
> into
> > the issues - only to find rather complex manners of building "tables" for
> > what constitutes characters in various locales.
>
> Locales are sooo 20th century... :-) Seriously, I think the world is
> rapidly moving away from locales and using Unicode instead. Then you can
> have French, German, Turkish, etc. characters *in the same string*.
>
> PCRE supports Unicode encoded as UTF-8 and, as of release 8.10, does
> allow you to make \w do what you want in UTF-8 mode. As for making
> PCRE_CASELESS behave similarly, that was introduced even longer ago for
> UTF-8 mode. Check out the PCRE_UTF8 option flag.
>
> > I have simple functions elsewhere in my code (isLetter, isLower, isUpper,
> > toLower, toUpper) doing precisely that - if PCRE doesn't have a generic
> > mode per the above could you advise in which source file(s) I would need
> to
> > insert my functions to obtain the desired generic character acceptance?
>
> isLetter() etc. take integers as arguments, and so work fine with
> character values greater than 255.
>
> Now, if you are indeed working with locales instead of Unicode, that is,
> restricting yourself to only 256 character code points, I'm afraid
> you'll have to tangle with the creation of appropriate tables for each
> locale, because the same code value (in the range 128-255) can be
> different things in different locales. If you try to do it by inserting
> code into PCRE, how do you define which characters are letters? You'll
> have to have your own tables, so why not use PCRE's existing code? Then
> you won't have to upgrade for each PCRE release.
>
> PCRE comes with all the necessary apparatus for building tables in the
> current locale. Check out the maketables.c and dftables.c source files.
> However, I would seriously advise you to look at Unicode if you are not
> already using it.
>
> I hope this helps.
>
> Philip
>
> --
> Philip Hazel
>