Re: [pcre-dev] PCRE bug

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Srinivas R Thota
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE bug
On Wed, 9 Apr 2008, Srinivas R Thota wrote:

> When I tried to match
>
> "000*1211" on regular exp
>
> ((0){1,})|((0)|(1)|(2)){1,}(\\*)((0)|(1)|(2)){1,}
>
> with pcre, the result in the match structure's (regmatch_t) first
> member should be containing
> the whole string that is matched which should be ""000*1211"" , but PCRE
> 7.0 is actually only
> matching "000" only 3 characters ?


PCRE uses Perl semantics for matching, even if you use the POSIX
interface, as I assume you have done (judging by your reference to
regmatch_t). This is clearly documented: the pcreposix man page says
this:

"When PCRE is called via these functions, it is only the API that is
POSIX-like in style. The syntax and semantics of the regular
expressions themselves are still those of Perl..."

Perl finds the *first* match, which is not necessarily the
*longest* match. Your pattern starts like this (putting in some spaces
for readability):

( (0){1,} ) |

That alternative is tested first; it matches the 000, and so that is
what is reported as the match. PCRE looks no further (neither does
Perl).

You probably want to write the top-level alternatives in the other
order so that the more complicated one is tested first.

Philip

--
Philip Hazel