On Mon, 27 May 2019, Zoltán Herczeg wrote:
> that is strategical difference. You don't know the input from the
> pattern, and your input has no a-d characters. The interpreter only
> searches 'a', while jit searches two characters: 'a' and 'd' which
> distance is two. The latter is more complicated, but works better for
> random input. You can see the difference here:
The interpreter searches for 'a' using mamchr(); if it finds 'a' it then
does a second search for 'd' (again using memchr()) before running the
match. If you set no_start_optimize, to disable these optimizations,
there is a huge penalty.
> ./pcre2test -tm
> PCRE2 version 10.34-RC1 2019-04-22
> re> /abcd/
> data> \[012345678a]{2000}
> Match time 0.1659 milliseconds
> No match
> data>
> re> /abcd/jit
> data> \[012345678a]{2000}
> Match time 0.0027 milliseconds
> No match
Thanks for posting that example. I've just noticed that an improvement
may be possible in the interpreter - the search for 'd' happens only if
the subject is quite short, because searching very long strings takes
time - or at least it does when memchr() is not used. It cannot be used
for 16-bit and 32-bit strings, and originally it was not used for 8-bit
strings. I will do some experiments to see if that restriction can be
lifted for 8-bit strings (and if it improves performance).