[pcre-dev] [Bug 1348] very slow execution of some non-matchi…

Pàgina inicial
Delete this message
Autor: Philip Hazel
Data:  
A: pcre-dev
Assumpte: [pcre-dev] [Bug 1348] very slow execution of some non-matching patterns
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1348




--- Comment #1 from Philip Hazel <ph10@???> 2013-04-17 16:30:39 ---
On Wed, 17 Apr 2013, Miklos Tirpak wrote:

> Consider the following 5 patterns, they differ only at the end:
>
> 1. "^([0-9a-fA-F#\*]+[\-\.\(\)]?)*[0-9a-fA-F#\*]+asd1$"
> 2. "^([0-9a-fA-F#\*]+[\-\.\(\)]?)*[0-9a-fA-F#\*]+asd2$"
> 3. "^([0-9a-fA-F#\*]+[\-\.\(\)]?)*[0-9a-fA-F#\*]+asd3$"
> 4. "^([0-9a-fA-F#\*]+[\-\.\(\)]?)*[0-9a-fA-F#\*]+asd4$"
> 5. "^([0-9a-fA-F#\*]+[\-\.\(\)]?)*[0-9a-fA-F#\*]+asd5$"
>
> The input string is the same in each case, "9999999992250999", none of the
> patterns match.
>
> When I try to match the following patterns on the input string and measure the
> execution time, the 2nd and the 5th execution takes very long.


This is because PCRE has an optimization. It knows that anything
matching the 1st pattern must contain "1", because that is the last
literal character in the pattern, and similarly for all the others. Your
input string does not contain "1", "3", or "4". Therefore, the 1st, 3rd,
and 4th patterns fail very quickly, without ever trying to run the
matching engine.

In the other two cases, the matching engine has to run. Your pattern
looks like a typical pattern that has a very large number of
possibilities in its matching tree. In particular, it includes nested
unlimited repeats, which are always the sign of a pattern that behaves
like this. Often, running such a pattern on data that does not match
results in a segfault.

This is not a bug, though it is frequently mis-reported as one, so much
so that I keep this standard response on file:

------------------------------------------------------------------------
1. Matching a regular expression is like finding your way through a
forest with many branching paths. As PCRE passes each junction, it has
to remember data so that it can backtrack to that point if necessary. By
default, it uses recursion to store this data on the process stack,
because that is fast. However, it can alternatively be compiled to use
the heap instead (run ./configure with --disable-stack-for-recursion),
but that slows performance.

2. It is very easy to write a regular expression that has a very large
number of branches (unlimited repetition of a group, for example). When
PCRE goes deep into such a tree, it may use a lot of memory.

3. Even in these days of gigabyte main memories, some operating system
environments set small default limits on the maximum size of the process
stack, for example, 8Mb. Thus, it is often the case that there is more
heap than stack available (by default). A matching operation that needs
a lot of memory may succeed if the heap is used, but run out of memory
if the stack is used.

4. Running out of stack often causes a segfault. Because of this, PCRE
contains the facility to limit the depth of recursion so as to return an
error code instead. However, the default value is large, so it does not
normally come into play unless you explicitly set a smaller value.

5. If you are running into a problem of stack overflow, you have the
following choices:

  (a) Work on your regular expression pattern so that it uses less 
      memory. Sometimes using atomic groups can help with this.
  (b) Increase the size of your process stack.
  (c) Compile PCRE to use the heap instead of the stack.
  (d) Set PCRE's recursion limit small enough so that it gives an error
      before the stack overflows.    


6. There is more discussion of some of these ideas in the pcrestack.3
man page.
------------------------------------------------------------------------

In your case, the stack must be sufficiently large for the entire tree
search to take place - but this does take a long time. If you change
some of your repetitions to be atomic, the run time will be drastically
shorter.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email