Re: [pcre-dev] is this a BUG in PCRE 7.0 ?

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Rain Chen
CC: pcre-dev
Subject: Re: [pcre-dev] is this a BUG in PCRE 7.0 ?
On Mon, 11 Jun 2007, Rain Chen wrote:

> PHP upgrade to 5.2.3, and it using PCRE Library Version => 7.0 18-Dec-2006
>
> this version PCRE seems doesn't work well with PHP.
>
> I met same problem with php5.2.1+PCRE 7.0 in FreeBSD 6.2, resolved by
> downgrading PCRE to 6.7


I am not a PHP user, and know very little about PHP. In order to fix any
possible bug in PCRE, I need to be able to demonstrate the bug without
using PHP.

> Reproduce code:
> ---------------
> <?php
> $str = "repeater id='loopt' dataSrc=subject colums=2";
> preg_match_all("/(['\"])((.*(\\\\\\1)*)*)\\1/sU",$str,$str_instead);
>
> echo "<xmp>";
> print_r($str_instead);
> ?>


I'm not familiar with PHP, but I *think* the equivalent test using the
pcretest program would be to use this pattern:

/(['"])((.*(\\\1)*)*)\1/sUg

However, the s and the g flags make no difference, so I will ignore them
from now on. That pattern means:

  match ' or "
  match any number of characters, ungreedily, followed by a literal 
    backslash, followed by what group 1 matched (' or "), zero or more 
    times.
  repeat all of that zero or more times
  then match what group 1 matched (' or ")


The pattern is a scary one, because it contains nested unlimited
repeats, which are always dangerous, especially if something can match
no characters.

Your subject line does not contain any backslashes, so the optional
search for backslash followed by \1 will never match. So it's the same as
using this pattern:

/(['"])((.*)*)\1/U

I tried that, and also your original, with pcretest on 7.2-testing (not
yet released) and the output was

PCRE version 7.2-RC3 2007-06-05

/(['"])((.*(\\\1)*)*)\1/U
    repeater id='loopt' dataSrc=subject colums=2
 0: 'loopt'
 1: '
 2: loopt
 3: t


/(['"])((.*)*)\1/U
    repeater id='loopt' dataSrc=subject colums=2
 0: 'loopt'
 1: '
 2: loopt
 3: t


This is exactly what I would expect. After matching the ' it will match
the following characters one by one (because of the ungreedy flag). So
group 3 will match 5 times. For the first 4, the subsequent match of \1
fails. After the fifth time, it succeeds and the whole match is done.
The fifth time round the loop, group 3 captures 't'. The match is the
same as this pattern

/(['"])((.)(.)(.)(.)(.))\1

except that all the (.) parentheses are numbered 3.

I tried this same test on Perl 5.8 and got exactly the same result. I do
not think, therefore, that there is a problem with PCRE.

You said that there was a change from PCRE 6.7. There was this change
made for the 7.0 release:

38. Like Perl, PCRE detects when an indefinitely repeated parenthesized group
    matches an empty string, and forcibly breaks the loop. There were bugs in  
    this code in non-simple cases. For a pattern such as  ^(a()*)*  matched  
    against  aaaa  the result was just "a" rather than "aaaa", for example. Two
    separate and independent bugs (that affected different cases) have been   
    fixed.      


It may be that you were relying on a bug. :-(


> Expected result:
> ----------------
> <xmp>Array
> (
>     [0] => Array
>         (
>             [0] => 'loopt'
>         )

>
>     [1] => Array
>         (
>             [0] => '
>         )

>
>     [2] => Array
>         (
>             [0] => loopt
>         )

>
>     [3] => Array
>         (
>             [0] => loopt
>         )

>
>     [4] => Array
>         (
>             [0] =>
>         )


Why do you expect [3] to be loopt? It is true that that is *one*
possible way that the pattern could match, but there is no guarantee
that it does match that way.


> Actual result:
> --------------
> <xmp>Array
> (
>     [0] => Array
>         (
>         )


I presume that means it's failing to match. I can't comment on that.

> I report to PHP dev team.But they thounght this is not the PHP's problem but
> is PCRE.This why you got this mail.
>
> detail report at :http://bugs.php.net/bug.php?id=41638


I have not looked at that. I really don't want to have to learn about
PHP in order to maintain PCRE - there just isn't time to learn all the
applications that use PCRE. If somebody can send me a bug in the form of
"this pattern does not do the right thing with this string", then I am
very happy to try to see what is going on.

Philip

--
Philip Hazel, University of Cambridge Computing Service.