Autor: ph10 Data: A: Zoltán Herczeg CC: Magnus Rex, pcre-dev@exim.org Assumpte: Re: [pcre-dev] Non-capturing group overhead (100x)
On Tue, 5 Dec 2017, Zoltán Herczeg wrote:
> do you use JIT?
>
> The engine has special single character optimizations to make /.*/ fast.
... and even if you don't use JIT, the interpreting engine can handle .*
quickly. Think about it - the fragment .* can be processed by zipping
along the subject until either the end or a newline is reached. If you
have set the DOTALL option, it's even faster: it can just jump to the
end of the subject, effectively doing no work at all.
By contrast, when a non-capturing parenthesis is reached, a backtracking
point has to be established so that the engine can backtrack if the
first branch of the group fails, so that it can try other branches (it
doesn't know there is only one branch; this is generic code). The
backtrack is also necessary for repeated groups so that it can carry on
after a failure as long as an appropriate number of matches have
happened. A group such as (?:A){1,50} (where A may be complex) might match
45 times, then hit a failure, so it must try again after 44 matches, 43
matches, etc.
In other words, quite a lot of work has to be done to process
parenthesized groups, and currently there is no special optimization
that can turn (?:.)* into .* (that's effectively what it is, but why
would anybody write it that way?).
This kind of thing is discussed at length in Friedl's "Mastering Regular
Expressions" (O'Reilly). In Chapter 6 ("Crafting an efficient
expression") he actually says, under the heading "Don't add superfluous
parentheses", "Unless you need to know the last character matched by .*
don't use (.)*". Anybody who is working on optimizing regex matching who
has not read Friedl's book should rush out and buy a copy immediately
... and then read it very carefully. :-)