Re: [pcre-dev] Is it possible to match a UTF-8 string contai…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Frank Chang
CC: pcre-dev
Subject: Re: [pcre-dev] Is it possible to match a UTF-8 string containing 3 or more far apart PCRE UTF-8 codepoints?
On Wed, 27 Jun 2012, Frank Chang wrote:

> Good morning, We are trying to match the German string 44 41 53 20 74 61
> 75 73 65 6E 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65 69
> 6E 74 61 75 73 65 6E 64 73 63 68 C3 A7 6E 65 20 with the PCRE regex
> \x{00F6}.*\x{00E4}.*\x{00E7). C/C++ PCRE compiles this regex successly
> pcre_exec returns 1 match, at byte positions 14 and 45 instead of 4
> matches. Our hypothesis is that we have the wrong PCRE regex? If so, what
> might be a PCRE regex that functions properly. THank you.


pcre_exec only ever returns 1 match. If you want to look for further
matches, you have to call it again on the remainder of the string. There
is sample code for doing this in the pcredemo.c program, which should
also be displayed if you run "man pcredemo". Look down the end for where
it handles the -g option.

For a good discussion of the general use of regex, see Jeffrey Friedl's
book "Mastering Regular Expressions", 3rd Edition, O'Reilly.

Philip

--
Philip Hazel