Re: [pcre-dev] Non-capturing group overhead (100x)

Author: ph10
Date:
To: Magnus Rex
CC: pcre-dev
Subject: Re: [pcre-dev] Non-capturing group overhead (100x)

On Wed, 6 Dec 2017, Magnus Rex wrote:

> What I am really doing is having expressions like (simplified)
>
> /(?:.[^x].....)*(..x.....)/
>
> That is, I have a sequence of thousands of fixed length words (8 characters
> in the example above) encoding some data structure. I want to search for a
> specific word (identified by x at position 3), but as I don't have any word
> separator, I need to match and ignore words from the beginning to stay on
> the word boundaries (x can appear in any position within the word).

It's always interesting playing around with patterns that are used in
the real world. I tried your pattern on a line of length 808 characters,
which was 12345678 repeated 101 times, except that the second last had
an x in position 3. (In other words, it had to skip 99 sequences before
finding a match.) I timed the match using pcre2test, that is, I tested
with PCRE2, not PCRE1. (But then I tried a few tests with PCRE1 and got
similar figures.)

For your basic pattern the match time I got was around 0.43 (never mind
the units). Then I added (?s) at the start of your pattern - this sets
the PCRE2_DOTALL flag, which means that the interpreter does not have to
check for a newline every time it matches a dot metacharacter. The time
went down to 0.35 - quite a substantial reduction.

Then I rewrote your pattern, just to see how this performed:

/(?s)(?(?=..x)(..x.....)(*ACCEPT)|.{8})*/

The time went down to 0.08! All this using the interpreter, not JIT.
One of the reasons for this is that .{8} in a non-UTF environment, with
DOTALL enabled, can just increment the current position by 8 code units,
without checking anything. However, even using ........ only puts the
time up to 0.095 (it has to add 1 eight times).

This will all be slower in UTF mode because of the possibility of
characters occupying more than one code unit. The exercise nicely
illustrates Friedl's point that crafting one's patterns carefully and
testing - because different engines have different strategies - can pay
dividends if you are seeking the best performance.

Explanation of my pattern:

(?s)        Set the DOTALL flag so . can match newlines, avoiding the 
            need to test for them.

(?(?=..x)   Start of a conditional group, where the condition is a 
            lookahead to see if there is x three characters ahead.
            (So it only has to process 3 characters before deciding 
            whether to skip or match.)

(..x.....)(*ACCEPT)  If x is found, match the word into group 1, then
                     end the match successfully.

|.{8})*     If not, skip 8 characters and repeat the conditional group.

I doubt whether this is the most efficient - I would not be surprised if
it could be further optimized.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at