[pcre-dev] [Bug 1467] Pcre hang up on specific regular expre…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1467] Pcre hang up on specific regular expression and JIT enabled
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1467




--- Comment #5 from Philip Hazel <ph10@???> 2014-04-21 18:09:36 ---
I don't see the 995,996 differences when I try this on my box. In all cases, if
I disable JIT, I get a fast "match limit exceeded" return.

Incidentally: Using alternation for single characters is using a sledgehammer
to crack a nut. (A|B|C) is much better written as [ABC]. The reason is that the
overheads for setting up a nested mini-regex (which is what parentheses do) are
much larger than a character class, where the engine knows it is matching just
one character. OK, there's a bit of complication in this case because of the
use of '.', which does not match newline, but (.|\n) is not the best way of
matching "any character". There is worse news when repeats are involved.
Repeated groups are duplicated in the compiled code. For example, (A){3,4} is
compiled as if it was (A)(A)(A(A)?) so that each iteration can have its own
backtracking points.

You can't use a class for (.|\n) but you can temporarily ensure that '.'
matches any character by switching into "single line" mode. On my box, the
matching times for /(.|\n){999}/ and /(?s:.{999})/ on a long string are 0.0224
and 0.0001. This is not surprising because in the second case PCRE knows just
to skip along 999 characters. Another alternative, if you know your data does
not contain binary zeroes, is to use something like [^\x0] to match any
character except a binary zero. This is slower (0.0005) because it has to check
each character.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email