Re: [pcre-dev] PCRE regex engine crash

Góra strony
Delete this message
Autor: Sowmy Srinivasan
Data:  
Dla: pcre-dev@exim.org
Temat: Re: [pcre-dev] PCRE regex engine crash
Thanks for your prompt response. I will investigate the pattern and get back to you.
One additional data is that the input pattern is an email in one of the Asian languages. Would that make a difference.
Thanks
Sowmy

-----Original Message-----
From: Philip Hazel [mailto:ph10@hermes.cam.ac.uk]
Sent: Thursday, November 11, 2010 9:01 AM
To: Sowmy Srinivasan
Cc: 'pcre-dev@???'
Subject: Re: [pcre-dev] PCRE regex engine crash

On Thu, 11 Nov 2010, Sowmy Srinivasan wrote:

> We are evaluating Windows version 8.10 of PCRE to replace our older version. When testing we hit stack overflow exception for the following regex.
> (?:\<\/?(?:font|strong|span)[^\>]*\> ?){4,}\<IMG
> SRC\=(?:3d)?\"?cid\:[^\>]+\>
>
> Is this something that is known and fixed in subsequent releases?


This is not a bug, though it is frequently mis-reported as one, so much so that I keep this standard response on file:

------------------------------------------------------------------------
1. Matching a regular expression is like finding your way through a forest with many branching paths. As PCRE passes each junction, it has to remember data so that it can backtrack to that point if necessary. By default, it uses recursion to store this data on the process stack, because that is fast. However, it can alternatively be compiled to use the heap instead (run ./configure with --disable-stack-for-recursion), but that slows performance.

2. It is very easy to write a regular expression that has a very large number of branches (unlimited repetition of a group, for example). When PCRE goes deep into such a tree, it may use a lot of memory.

3. Even in these days of gigabyte main memories, some operating system environments set small default limits on the maximum size of the process stack, for example, 8Mb. Thus, it is often the case that there is more heap than stack available (by default). A matching operation that needs a lot of memory may succeed if the heap is used, but run out of memory if the stack is used.

4. Running out of stack often causes a segfault. Because of this, PCRE contains the facility to limit the depth of recursion so as to return an error code instead. However, the default value is large, so it does not normally come into play unless you explicitly set a smaller value.

5. If you are running into a problem of stack overflow, you have the following choices:

  (a) Work on your regular expression pattern so that it uses less 
      memory. Sometimes using atomic groups can help with this.
  (b) Increase the size of your process stack.
  (c) Compile PCRE to use the heap instead of the stack.
  (d) Set PCRE's recursion limit small enough so that it gives an error
      before the stack overflows.    


6. There is more discussion of some of these ideas in the pcrestack.3 man page.
------------------------------------------------------------------------

I have not evaluated your regex in detail, but I see that it does contain nested unlimited repetition.

There is, of course, the possibility that there is a real bug that is causing your stack overflow, though I suspect the probability is small.
Since you did not include the data on which you are trying your pattern, I can't try to reproduce the problem. One sure test is to increase the size of the stack and see if that fixes it.

8.10 is the current release; I will shortly (probably within a week) be putting out a test candidate for 8.11 but there is no change in this regard - there can't be: one just needs a lot of storage if there is a lot of backtracking to remember.

Philip

--
Philip Hazel