On Fri, 21 Feb 2020, Kilian Kilger via Pcre-dev wrote:
> Matching with jit, it was very easy to produce an example which
> exceeds the available resources: We take the pattern
> "(*LIMIT_MATCH=10)(x+x+x+x+)+y" and as subject we take a string of
> length 100000 containing only the letter "x".
>
> It was however surprisingly difficult to get the normal non-jit
> matcher pcre2_match to run out of resources. With the "y" at the end
> of the pattern, we never got any resource exhaustion in the non-jit
> case.
That's because of this optimization:
$ ./pcre2test -i
PCRE2 version 10.35-RC1 2019-11-27
re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"
Capture group count = 1
Match limit = 10
First code unit = 'x'
Last code unit = 'y' <=================================
Subject length lower bound = 5
The subject will be searched for 'y' before doing anything else.
However, no more that 5000000 bytes are searched.
re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"
data> \[x]{1000000}
No match
data> \[x]{6000000}
Failed: error -47: match limit exceeded
JIT must be doing something different. Perhaps Zoltán will answer.
> At the end we only succeeded by taking a very long pattern which
> consisted of "(*LIMIT_HEAP=1)(*LIMIT_DEPTH=1)(*LIMIT_MATCH=1)" and
> then 10000 times the string "x+z?". Even then, appending the "y" at
> the end makes the match run very fast.
You should have tried PCRE2_NO_START_OPTIMIZE:
re> "(*LIMIT_MATCH=10)(x+x+x+x+)+y"no_start_optimize
data> \[x]{100000}
Failed: error -47: match limit exceeded
>
> Why is this so different between jit/non-jit. What surprising
> optimizations are done in the non-jit-case?
Various optimizations are possible, but I don't have personal knowledge
of the JIT code. In the non-JIT case, there are checks for a starting
code unit and a "must be present" code unit, and the subject length is
checked for a minimum length (5 in your example). In multiline mode
there may be a check for starting a match at the start of a line instead
of at a specific code unit.
Philip
--
Philip Hazel