[pcre-dev] [Bug 1347] Deep recursion causing SegFault

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1347] Deep recursion causing SegFault

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1347

--- Comment #4 from Philip Hazel <ph10@???> 2013-04-17 16:56:03 ---
On Tue, 16 Apr 2013, BenBE@??? wrote:

> > This is not a bug, though it is frequently mis-reported as one, so much
> > so that I keep this standard response on file:
> Unfortunately that's the default behaviour for most distributions
> causing quite some hard to debug crashes as (see stacktrace of my
> original report) there's not much relation between e.g. the executed PHP
> code and the actual crash down in libpcre - thus I'm getting quite
> regularly reports on crashes where there's not much debugging that can
> be done. In this case I got quite lucky as there's a clear sample
> causing the trouble where even the code to reproduce is quite minimal.

As you probably realize, I work only on PCRE. I don't use PHP and know
nothing about it. All those PHP functions in your backtrack don't mean
anthing to me, I'm afraid. Nor do I know anything about GeSHi or Lua. In
order to check for bugs, I have to have the pattern and subject string
that is being matched. Sorry if that sounds harsh!

> But even than I think here's the best way to workaround the issue. How
> much work would it be to allow for libpcre to be compiled ina kinda
> "hybrid mode" where stack-based matching is tried first and if reaching
> the (stack) recursion limit a retry using the heap-based implementation
> is performed with much higher limits?

A client program could do that for itself. It could compile two separate
PCRE libraries (for each data size - 8, 16, 32 - that it needs) and when
a call to the stack-based one returned with a recursion limit exceeded,
it could call the other library. [To avoid name clashes, there would
have to be some fiddling, probably involving macros, when compiling the
two libraries. Sorting our this detail would involve some work, I
guess.]

> Is there a way to get some "stats" for a regexp like max recursion
> depth, taken branches, steps needed for solution, ...?

The pcretest program has a facility for determining the max recursion
depth, given a pattern and a subject string. Grep for \M in the pcretest
man page.

> > (a) Work on your regular expression pattern so that it uses less > > memory. Sometimes using atomic groups can help with this. > Will need some debugging aid for this.

pcretest is your friend.

> Some questions on that:
> - - Do you know if PHP uses the pcrejit option? And if yes, since around
> which version?

Sorry, I know nothing about PHP.

> - - Given something like ^\d+(?:(?:A|B|BC|BD|C|DB)+\d+)*$ What would be
> the best way to ask libpcre to try various combinations of the letters
> while avoiding backtracking once one such consecutive group has been found?

Atomic groups and possessive quantifiers. For example, you probably
should write \d++ instead of \d+ above. Given a run of digits, there is
no point trying with all possible numbers of them, since what follows is
never a digit. [PCRE can sometimes convert \d+ into \d++ for you
automatically, but not in that example. It only does it when the very
next item is obviously not a digit.]

Again, if you write (?:A|B|BC|BD|C|DB)++ once it has found A it will
carry on to look for a digit, but if there isn't one, it won't look for
other letters. But that means you should re-order the alternatives, as
otherwise it will never find BC (for example) because it will find B
first.

Regards,
Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	1347 at
	Zoltan Herczeg at