Re: [pcre-dev] Order Insensitive PCRE regexes for matching m…

Top Page
Delete this message
Author: Frank Chang
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Order Insensitive PCRE regexes for matching multiple UTF-8 codepoints that are far apart does not function precisely
Mr. Philip Hazel. I tested the PCRE order insensititive regular expression
which you suggested. It functions quite well. Thank you for detailed
reply.

On Tue, Jun 26, 2012 at 8:55 AM, Philip Hazel <ph10@???> wrote:

> On Mon, 25 Jun 2012, Frank Chang wrote:
>
> > Good evening, We are using C/C++ PCRE 8.30 with PCRE_UTF8 | PCRE_UCP |
> > PCRE_COLLATE.Here's an order-insensitive
> > regex: '(?=.*\x{00F6})(?=.*\x{00E4})' It tries to use uses ?= or
> *positive
> > lookahead* to make sure both UTF-8 code points are matched in either
> order.
> > PCRE_compile() returns OK and PCRE_execute() returns OK on the string DAS
> > tausendschöne Jungfräulein . In hex, it is 44 41 53 20 74 61 75 73 65
> 6E
> > 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65 69 6E.
> >       However, GetMatchStart() returns 0 and GetMatchEnd() returns 0
> > instead of GetMatchStart() = 14 and GetMatchEnd() = 27 which we obtain
> when
> > we use the PCRE '\x{00F6}.*\x{00E4}' regex.
> >      Please advise us if it is possible to do order insensitive matching
> of
> > multiple UTF-8 code points in a PCRE regex. THank you.

>
> I do not recognize PCRE_COLLATE; it is not part of the base PCRE
> library. I have run your regex through the basic pcretest program, and
> it matches. This confirms your finding with PCRE_compile()
> and PCRE_execute().
>
> I don't recognize GetMatchStart() and GetMatchEnd() either. However,
> since your regex consists entirely of assertions, the actual matched
> string is empty (as pcretest shows). You need to modify your regex to
> actually match something if you want a match start and end to be given
> to you. If what you want is the string between these two code points, in
> either order, something simple like
>
> \x{00f6}.*?\x{00e4} | \x{00e4}.*?\x{00f6}
>
> (ignore white space) should do what you want.
>
> I realize that this example may be a simplification of your real
> application, and my simple suggestion does not scale very well. But the
> main point stands: if you want to extract strings, your regex must do
> some actual matching, not just assertions.
>
> Philip
>
> --
> Philip Hazel