Re: [pcre-dev] Order Insensitive PCRE regexes for matching m…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Frank Chang
CC: pcre-dev
Subject: Re: [pcre-dev] Order Insensitive PCRE regexes for matching multiple UTF-8 codepoints that are far apart does not function precisely
On Mon, 25 Jun 2012, Frank Chang wrote:

> Good evening, We are using C/C++ PCRE 8.30 with PCRE_UTF8 | PCRE_UCP |
> PCRE_COLLATE.Here's an order-insensitive
> regex: '(?=.*\x{00F6})(?=.*\x{00E4})' It tries to use uses ?= or *positive
> lookahead* to make sure both UTF-8 code points are matched in either order.
> PCRE_compile() returns OK and PCRE_execute() returns OK on the string DAS
> tausendschöne Jungfräulein . In hex, it is 44 41 53 20 74 61 75 73 65 6E
> 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65 69 6E.
>       However, GetMatchStart() returns 0 and GetMatchEnd() returns 0
> instead of GetMatchStart() = 14 and GetMatchEnd() = 27 which we obtain when
> we use the PCRE '\x{00F6}.*\x{00E4}' regex.
>      Please advise us if it is possible to do order insensitive matching of
> multiple UTF-8 code points in a PCRE regex. THank you.


I do not recognize PCRE_COLLATE; it is not part of the base PCRE
library. I have run your regex through the basic pcretest program, and
it matches. This confirms your finding with PCRE_compile()
and PCRE_execute().

I don't recognize GetMatchStart() and GetMatchEnd() either. However,
since your regex consists entirely of assertions, the actual matched
string is empty (as pcretest shows). You need to modify your regex to
actually match something if you want a match start and end to be given
to you. If what you want is the string between these two code points, in
either order, something simple like

\x{00f6}.*?\x{00e4} | \x{00e4}.*?\x{00f6}

(ignore white space) should do what you want.

I realize that this example may be a simplification of your real
application, and my simple suggestion does not scale very well. But the
main point stands: if you want to extract strings, your regex must do
some actual matching, not just assertions.

Philip

--
Philip Hazel