Re: [pcre-dev] Problem with a regexp

Page principale
Supprimer ce message
Auteur: Philip Hazel
Date:  
À: Thomas Weber
CC: pcre-dev
Sujet: Re: [pcre-dev] Problem with a regexp
On Wed, 17 Sep 2008, Thomas Weber wrote:

> I'm not sure if the following is expected behaviour or a bug (the
> problem came from a user who seems to experiment with regular
> expressions):
>
> =======================================================================
> $ pcretest
> PCRE version 7.6 2008-01-28
>
> re> '(\s*-*\d+[.]*\d*\s*)+\n'
> data> '\t4\n0000\t-0.00\t-0.0000\t4\t-0.00\t-0.0000\t4\n0000\t-0.00\t-0.0000\t0\t-0.00\t-'
> Error -8
> =======================================================================
>
> The same happens with 7.8. Is this error expected?


This same question was just discussed on the exim-dev mailing list. I
answered as follows:

======================================================================
My instant reaction to that pattern is "nested unlimited repeats -
Beware, Beware!" That kind of pattern is always in danger of running
into PCRE_ERROR_MATCHLIMIT because the size of the search tree explodes
exponentially.

> data>
>

'\t4\n0000\t-0.00\t-0.0000\t4\t-0.00\t-0.0000\t4\n0000\t-0.00\t-0.0000\t0\t-
> 0.00\t-'
> Error -8


I have not analyzed your data, but I'm not at all surprised. Note that,
for pcretest, you should not put the data in delimiters. In your
example, the quotes will be taken as part of the data.
======================================================================

The OP then said that that it worked with matlab, so I tried it with
Perl, and Perl was OK, so I then made this comment:

However, I note that Perl also manages to handle this particular
matching quite quickly. I expect there is some specific optimization
that it does. In fact, I can guess what it is. If you feed PCRE this
pattern:

'(\s*-*\d++[.]*\d*\s*)+\n'

(note the one additional '+' character), then it too finds the same
match as Perl, very quickly. My guess that is that Perl manages to
"auto-possessify" the \d+. PCRE does have code to do that in some cases
when what follows something like \d+ cannot possibly match \d, but it
isn't clever enough to handle this case (it looks only at the
immediately following item).

The OP's comment about \Q is not right; however, I can get the regex to
work by adding \q13000000 on the front of the data, though it is very
slow. This sets MATCH_LIMIT (the default is 10000000). I discovered what
number to use by adding \M on the front. This (as well as the fact that
possessifying the \d+ makes it work) confirms the original diagnosis.

Philip

--
Philip Hazel