Re: [pcre-dev] Question regarding regex complexity, catastr…

Top Page
Delete this message
Author: ph10
Date:  
To: Kilian Kilger
CC: pcre-dev
Subject: Re: [pcre-dev] Question regarding regex complexity, catastrophic backtrack and jit/no_jit
On Fri, 21 Feb 2020, Kilian Kilger via Pcre-dev wrote:

> Matching with jit, it was very easy to produce an example which
> exceeds the available resources: We take the pattern
> "(*LIMIT_MATCH=10)(x+x+x+x+)+y" and as subject we take a string of
> length 100000 containing only the letter "x".
>
> It was however surprisingly difficult to get the normal non-jit
> matcher pcre2_match to run out of resources. With the "y" at the end
> of the pattern, we never got any resource exhaustion in the non-jit
> case.


That's because of this optimization:

$ ./pcre2test -i
PCRE2 version 10.35-RC1 2019-11-27
re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"

Capture group count = 1
Match limit = 10
First code unit = 'x'
Last code unit = 'y'              <=================================
Subject length lower bound = 5


The subject will be searched for 'y' before doing anything else.
However, no more that 5000000 bytes are searched.

re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"
data> \[x]{1000000}

No match
data> \[x]{6000000}

Failed: error -47: match limit exceeded

JIT must be doing something different. Perhaps Zoltán will answer.

> At the end we only succeeded by taking a very long pattern which
> consisted of "(*LIMIT_HEAP=1)(*LIMIT_DEPTH=1)(*LIMIT_MATCH=1)" and
> then 10000 times the string "x+z?". Even then, appending the "y" at
> the end makes the match run very fast.


You should have tried PCRE2_NO_START_OPTIMIZE:

re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"no_start_optimize
data> \[x]{100000}

Failed: error -47: match limit exceeded
>
> Why is this so different between jit/non-jit. What surprising
> optimizations are done in the non-jit-case?


Various optimizations are possible, but I don't have personal knowledge
of the JIT code. In the non-JIT case, there are checks for a starting
code unit and a "must be present" code unit, and the subject length is
checked for a minimum length (5 in your example). In multiline mode
there may be a check for starting a match at the start of a line instead
of at a specific code unit.

Philip

--
Philip Hazel