Re: [pcre-dev] Is it possible to match a UTF-8 string contai…

Top Page
Delete this message
Author: Frank Chang
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Is it possible to match a UTF-8 string containing 3 or more far apart PCRE UTF-8 codepoints?
Mr. Philip Hazel, Yesterday I created a PCRE regex.
Id
RegEx Replace 59 (?=.+(\x{00F6})){1}(?=.+(\x{00E4})){1}(?=.+(\x{00E7})){1} $1
$2 $3

that matches the string, DAS tausendschöne Jungfräulein ausendschçne 4
times at byte position (14,16],(25,27] and (42,44). Is it good programming
practice to use PCRE regex's such as
'(?=.+(\x{00F6})){1}(?=.+(\x{00E4})){1}(?=.+(\x{00E7})){1}'. Thank you for
your help.

On Thu, Jun 28, 2012 at 1:13 AM, Philip Hazel <ph10@???> wrote:

> On Wed, 27 Jun 2012, Frank Chang wrote:
>
> > Good morning, We are trying to match the German string 44 41 53 20 74
> 61
> > 75 73 65 6E 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65
> 69
> > 6E 74 61 75 73 65 6E 64 73 63 68 C3 A7 6E 65 20 with the PCRE regex
> > \x{00F6}.*\x{00E4}.*\x{00E7). C/C++ PCRE compiles this regex successly
> > pcre_exec returns 1 match, at byte positions 14 and 45 instead of 4
> > matches. Our hypothesis is that we have the wrong PCRE regex? If so, what
> > might be a PCRE regex that functions properly. THank you.
>
> pcre_exec only ever returns 1 match. If you want to look for further
> matches, you have to call it again on the remainder of the string. There
> is sample code for doing this in the pcredemo.c program, which should
> also be displayed if you run "man pcredemo". Look down the end for where
> it handles the -g option.
>
> For a good discussion of the general use of regex, see Jeffrey Friedl's
> book "Mastering Regular Expressions", 3rd Edition, O'Reilly.
>
> Philip
>
> --
> Philip Hazel
>