Re: [pcre-dev] Getting PCRE to accept *any* valid letter

Author: Philip Hazel
Date:
To: Patrick Questembert
CC: pcre-dev
Subject: Re: [pcre-dev] Getting PCRE to accept *any* valid letter

On Tue, 3 May 2011, Patrick Questembert wrote:

> Delighted to hear back from Mr PCRE himself so quickly! We are using PCRE
> 7.8 and will update to 8.10 promptly.

Please go to 8.12, the current release. (I am working on 8.13 at the
moment.)

> - so we need to add "(*UCP)" verbatim to the start of each pattern where we
> want \w to properly accept any letter in any language?

Or set the PCRE_UCP option when you call pcre_compile().

> Just wondering why anyone would NOT want \w to encompass all letters

It's much faster if you don't, and if you happen to know you are
processing only English, though using UTF-8, you might want this. But
the real reason is that this is historic, and when the facility was
introduced, I did not make it the default, for backwards compatibility.

> - you seem to be saying that all letters were properly recognized caseless
> a long time ago? That's not what I am seeing with V7.8. For example:
> "koordinatör" (a Turkish job title ....) matches (?i:koordinatör) but
> "koordinatÖr" does not.

I assume you have compiled PCRE with UCP support as well as UTF-8? It
certainly works in PCRE 8.12:

PCRE version 8.12 2011-01-15

/koordinatör/i8
    koordinatör
 0: koordinat\x{f6}r
    koordinatÖr
 0: koordinat\x{d6}r

There are comments in the ChangeLog for 7.8 and other versions about
bugs fixed for caseless UTF-8 matching, so it must have been introduced
before then.

> May I forward some promo codes for you to play with our Android / iPhone
> business card reader? It may after all be the mobile app with the longest
> regular expressions (some of which I can't understand even though I wrote
> them :-))?

I suspect they won't be of use to me because I have neither an Android
nor an iPhone, nor indeed any kind of smartphone. I have rather old -
by today's fast standards - Samsung clamshell on a pay-as-you-go
contract that hardly ever gets topped up. In other words, I am not a
heavy user of cell phones. (In case you don't know, though I've worked
with computers for over 40 years, I'm now retired. So part of
yesterday's generation...)

> PS: I wanted to also make a couple of requests for the future ...
> 1. It would be amazingly useful if we could insert "replace" directives
> within regexes, e.g. match "(Great Britain|United Kingdom|England)" and if
> so, replace with "UK". The documented way to do that is to use capturing
> groups and then replace as necessary (and we do that extensively) but it is
> often very much more complicated than doing it "in place" within the pattern

People ask that from time to time. My answer always is that PCRE is a
matching library, not a string manipulation library and that I have
enough on my hands maintaining it, especially now that it's very much a
part-time hobby. Somebody keen should write a string-manipulation
library (which could, of course, use PCRE). Or possibly implement it as
a PCRE add-on.

> 2. It would be great to have state variables, e.g. if a specific capturing
> group (say detecting "www") was matched with characters before the current
> positions, be able to set and test a state variable (say GOT_URL). Apologies
> if this is an existing part of the standard which I missed ...

You might be able to do that using callouts.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	Thorsten Schöning at