[pcre-dev] Non-capturing group overhead (100x)

Góra strony
Delete this message
Autor: Magnus Rex
Data:  
Dla: pcre-dev
Temat: [pcre-dev] Non-capturing group overhead (100x)
Hi,

When optimizing a regular expression in Elixir I was surprised to see
that there is a massive overhead to using non-capture groups. I find
that /(?:.)*/ takes 100 times longer to execute than /.*/. Why is
that, and is there anything I can do to speed it up?

Elixir uses Erlang and PCRE 8.41 for regular expressions.
Investigating further using the Elixir console (see expanation below):


$ iex
Erlang/OTP 20 [erts-9.1.1] [source] [64-bit] [smp:4:4] [ds:4:4:10]
[async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Interactive Elixir (1.5.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> s = String.duplicate("a",1000000);nil
nil
iex(2)> time = fn f -> t=:os.system_time(:milli_seconds); f.();
:os.system_time(:milli_seconds)-t end
#Function<6.99386804/1 in :erl_eval.expr/5>
iex(3)> time.(fn -> :re.run(s, ".*") end )
8
iex(4)> time.(fn -> :re.run(s, "(?:.)*") end )
669
iex(5)> time.(fn -> :re.run(s, "(?:..)*") end )
281
iex(6)> time.(fn -> :re.run(s, "(?:....)*") end )
140
iex(7)> time.(fn -> :re.run(s, "(?>.)*") end )
678
iex(8)> time.(fn -> :re.run(s, "(.)*") end )
1558
iex(9)> time.(fn -> :re.run(s, "(..)*") end )
607
iex(10)> time.(fn -> :re.run(s, "(....)*") end )
270
iex(11)> time.(fn -> :re.run(s, ".((?:.)*)") end )
602
iex(12)> time.(fn -> :re.run(s, ".((?:..)*)") end )
285
iex(13)> time.(fn -> :re.run(s, ".((?:....)*)") end )
147


#1 above creates a string with one million characters.
#2 sets up a function to measure execution time in milliseconds.
#3-4 gives that it is 100 times more expensive with a non-capturing group.
#4-6 gives that the execution time is linear to the number of groups captured.
#7 gives that the same is true for atomic groups.
#8-10 shows the extra overhead with actually capturing the groups. It
takes double time compared with non-capturing.
#11-13 shows that it is really the number of groups that set the
execution time and not the length of what is captured.

I have seen the same effect when trying other languages that used
PCRE, but not when using JavaScript of Googles RE2.


I would be really grateful if someone could help me understand this
unexpected result.

/ Magnus

--
Magnus Rex
https://twitter.com/rexikan