[pcre-dev] [Bug 1347] Deep recursion causing SegFault

Author: 1347
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1347] Deep recursion causing SegFault

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1347

--- Comment #3 from BenBE@??? 2013-04-16 18:42:55 ---
Hi Phil,

and thanks for this (even if canne) response - I feared as much ;-)
>> For a website using Syntax Highlighting based on GeSHi 1.0.X there's a
>> reproduceable crash for certain inputs that causes a Segfault while the
>> highlighting is performed.
> I haven't time to look at this in detail just at the moment, but as you
> mention the word "recursion", it may be just normal PCRE behaviour ...
> using a lot of stack.
Actually I already had several occasions (or rather did the users of
GeSHi) have spurious crashes based on this behaviour.

So let's have a look at the various points.
> This is not a bug, though it is frequently mis-reported as one, so much
> so that I keep this standard response on file:
Unfortunately that's the default behaviour for most distributions
causing quite some hard to debug crashes as (see stacktrace of my
original report) there's not much relation between e.g. the executed PHP
code and the actual crash down in libpcre - thus I'm getting quite
regularly reports on crashes where there's not much debugging that can
be done. In this case I got quite lucky as there's a clear sample
causing the trouble where even the code to reproduce is quite minimal.
> ------------------------------------------------------------------------
> 1. Matching a regular expression is like finding your way through a
> forest with many branching paths. As PCRE passes each junction, it has
> to remember data so that it can backtrack to that point if necessary. By
> default, it uses recursion to store this data on the process stack,
> because that is fast. However, it can alternatively be compiled to use
> the heap instead (run ./configure with --disable-stack-for-recursion),
> but that slows performance.
Unfortunately not an option when writing a library that's dependent on
the system-installed versions and configurations of things.

But even than I think here's the best way to workaround the issue. How
much work would it be to allow for libpcre to be compiled ina kinda
"hybrid mode" where stack-based matching is tried first and if reaching
the (stack) recursion limit a retry using the heap-based implementation
is performed with much higher limits?

Given (as you say so yourself below) that usually you can rewrite
regular expressions to avoid recursion as much as possible this could be
a fallback that solved two issues:
- - libpcre causing random crashes due to complex queries
- - if your task is complex enough to exceed "softlimits" usually
performance won't matter anymore

Having a slow solution is better than having none at all.
> 2. It is very easy to write a regular expression that has a very large
> number of branches (unlimited repetition of a group, for example). When
> PCRE goes deep into such a tree, it may use a lot of memory.
I know. And I guess I'm hitting something like this in the current
situation.

Is there a way to get some "stats" for a regexp like max recursion
depth, taken branches, steps needed for solution, ...?
> 3. Even in these days of gigabyte main memories, some operating system
> environments set small default limits on the maximum size of the process
> stack, for example, 8Mb. Thus, it is often the case that there is more
> heap than stack available (by default). A matching operation that needs
> a lot of memory may succeed if the heap is used, but run out of memory
> if the stack is used.
See suggestion above.
> 4. Running out of stack often causes a segfault. Because of this, PCRE
> contains the facility to limit the depth of recursion so as to return an
> error code instead. However, the default value is large, so it does not
> normally come into play unless you explicitly set a smaller value.
Unfortunately PHP has a horrible interface in regards to this: I don't
really see that error code and usually I just get some empty return
string instead, so I can't even check for the recursion stack overflow
condition. And even than, due to the way GeSHi works there are quite
some regular expressions to be evaluated and thus checking for each and
every return value would be quite a hit on performance.
> 5. If you are running into a problem of stack overflow, you have the
> following choices:
>
> (a) Work on your regular expression pattern so that it uses less > memory. Sometimes using atomic groups can help with this.

Will need some debugging aid for this.

> (b) Increase the size of your process stack.
Can't influence this for PHP. Also GeSHi is a library --> no chance.
> (c) Compile PCRE to use the heap instead of the stack.
Can't influence this. Would have to be changed in the distribution. ->
No chance.
> (d) Set PCRE's recursion limit small enough so that it gives an error > before the stack overflows.

Well, this should avoid the crash, sure. But won't help with the actual
issue of the expression complexity.
> 6. There is more discussion of some of these ideas in the pcrestack.3
> man page.
Some questions on that:
- - Do you know if PHP uses the pcrejit option? And if yes, since around
which version?
- - Given something like ^\d+(?:(?:A|B|BC|BD|C|DB)+\d+)*$ What would be
the best way to ask libpcre to try various combinations of the letters
while avoiding backtracking once one such consecutive group has been found?
> Philip
I think the issue itself might reside on both sides, thus some
assistance with debugging and creating a suiteable solution would be
nice. There aren't any real time constraints, thus if finding time to
look at things in more detail takes a while it shouldn't matter.

Best regards,
BenBE.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Philip Hazel at