Re: [pcre-dev] Non-capturing group overhead (100x)

Top Page
Delete this message
Author: Magnus Rex
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Non-capturing group overhead (100x)
Thanks for the insights!

What I am really doing is having expressions like (simplified)

/(?:.[^x].....)*(..x.....)/

That is, I have a sequence of thousands of fixed length words (8 characters
in the example above) encoding some data structure. I want to search for a
specific word (identified by x at position 3), but as I don't have any word
separator, I need to match and ignore words from the beginning to stay on
the word boundaries (x can appear in any position within the word).

I also need to do this both in the browser (JavaScript) and on the server
(Elixir using Erlang and pcre). The searches take around 1ms in the browser
and 100ms on the server even though I have taken care not to get any major
backtracking. This search is also the current performance bottleneck of the
whole app.

To sum up, it was surprising to me to find such a high extra cost as it is
still more or less a linear search without much backtracking. When I didn't
see the cost at all in JavaScript, I thought that I might have stumbled on
a bug or something.

(I don't know if I benefit from JIT as I am using pcre through the Erlang
libraries.)

Thanks,

/ Magnus

--
Magnus Rex
Mobile: +46 731 80 40 60
https://twitter.com/rexikan

On Tue, Dec 5, 2017 at 10:55 AM, <ph10@???> wrote:

> On Tue, 5 Dec 2017, Zoltán Herczeg wrote:
>
> > do you use JIT?
> >
> > The engine has special single character optimizations to make /.*/ fast.
>
> ... and even if you don't use JIT, the interpreting engine can handle .*
> quickly. Think about it - the fragment .* can be processed by zipping
> along the subject until either the end or a newline is reached. If you
> have set the DOTALL option, it's even faster: it can just jump to the
> end of the subject, effectively doing no work at all.
>
> By contrast, when a non-capturing parenthesis is reached, a backtracking
> point has to be established so that the engine can backtrack if the
> first branch of the group fails, so that it can try other branches (it
> doesn't know there is only one branch; this is generic code). The
> backtrack is also necessary for repeated groups so that it can carry on
> after a failure as long as an appropriate number of matches have
> happened. A group such as (?:A){1,50} (where A may be complex) might match
> 45 times, then hit a failure, so it must try again after 44 matches, 43
> matches, etc.
>
> In other words, quite a lot of work has to be done to process
> parenthesized groups, and currently there is no special optimization
> that can turn (?:.)* into .* (that's effectively what it is, but why
> would anybody write it that way?).
>
> This kind of thing is discussed at length in Friedl's "Mastering Regular
> Expressions" (O'Reilly). In Chapter 6 ("Crafting an efficient
> expression") he actually says, under the heading "Don't add superfluous
> parentheses", "Unless you need to know the last character matched by .*
> don't use (.)*". Anybody who is working on optimizing regex matching who
> has not read Friedl's book should rush out and buy a copy immediately
> ... and then read it very carefully. :-)
>
> Philip
>
> --
> Philip Hazel