Re: [pcre-dev] Non-capturing group overhead (100x)

Top Page
Delete this message
Author: Zoltán Herczeg
Date:  
To: Magnus Rex, pcre-dev@exim.org
Subject: Re: [pcre-dev] Non-capturing group overhead (100x)
Hi,

do you use JIT?

The engine has special single character optimizations to make /.*/ fast. The generic /(?:anything)*/ is much slower.

The numbers below are reasonable. E.g. using capturing bracket is the slowest: /(.)*/, since you need to store extra data.

Yes, you can make pattern matching fast, but the solution depends on what you do. Backtracking engines are similar to algorithmic languages, they provide a lot of tools, but the user need to choose the right one. E.g. a quicksort in python is faster than a bubble sort in C even though C is much faster than python.

Regards,
Zoltan
 
-------- Eredeti levél --------
Feladó: Magnus Rex < magnus@??? (Link -> mailto:magnus@rexikan.se) >
Dátum: 2017 december 4 17:22:20
Tárgy: [pcre-dev] Non-capturing group overhead (100x)
Címzett: pcre-dev@??? (Link -> mailto:pcre-dev@exim.org)
 
Hi,
When optimizing a regular expression in Elixir I was surprised to see
that there is a massive overhead to using non-capture groups. I find
that /(?:.)*/ takes 100 times longer to execute than /.*/. Why is
that, and is there anything I can do to speed it up?
Elixir uses Erlang and PCRE 8.41 for regular expressions.
Investigating further using the Elixir console (see expanation below):
$ iex
Erlang/OTP 20 [erts-9.1.1] [source] [64-bit] [smp:4:4] [ds:4:4:10]
[async-threads:10] [hipe] [kernel-poll:false] [dtrace]
Interactive Elixir (1.5.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> s = String.duplicate("a",1000000);nil
nil
iex(2)> time = fn f -> t=:os.system_time(:milli_seconds); f.();
:os.system_time(:milli_seconds)-t end
#Function<6.99386804/1 in :erl_eval.expr/5>
iex(3)> time.(fn -> :re.run(s, ".*") end )
8
iex(4)> time.(fn -> :re.run(s, "(?:.)*") end )
669
iex(5)> time.(fn -> :re.run(s, "(?:..)*") end )
281
iex(6)> time.(fn -> :re.run(s, "(?:....)*") end )
140
iex(7)> time.(fn -> :re.run(s, "(?>.)*") end )
678
iex(8)> time.(fn -> :re.run(s, "(.)*") end )
1558
iex(9)> time.(fn -> :re.run(s, "(..)*") end )
607
iex(10)> time.(fn -> :re.run(s, "(....)*") end )
270
iex(11)> time.(fn -> :re.run(s, ".((?:.)*)") end )
602
iex(12)> time.(fn -> :re.run(s, ".((?:..)*)") end )
285
iex(13)> time.(fn -> :re.run(s, ".((?:....)*)") end )
147
#1 above creates a string with one million characters.
#2 sets up a function to measure execution time in milliseconds.
#3-4 gives that it is 100 times more expensive with a non-capturing group.
#4-6 gives that the execution time is linear to the number of groups captured.
#7 gives that the same is true for atomic groups.
#8-10 shows the extra overhead with actually capturing the groups. It
takes double time compared with non-capturing.
#11-13 shows that it is really the number of groups that set the
execution time and not the length of what is captured.
I have seen the same effect when trying other languages that used
PCRE, but not when using JavaScript of Googles RE2.
I would be really grateful if someone could help me understand this
unexpected result.
/ Magnus
--
Magnus Rex
https://twitter.com/rexikan
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev