Re: [pcre-dev] A performance question about PCRE library

Inizio della pagina
Delete this message
Autore: Philip Hazel
Data:  
To: 徐?z
CC: pcre-dev
Oggetto: Re: [pcre-dev] A performance question about PCRE library
On Mon, 15 Aug 2011, 徐?z wrote:

> Case 1:
>     respectively match data source with the following 4 regular expressions.

>
>     \btype\b\W*?\btext\b\W*?\bjavascript\b
>     \burl\b\W*?\bshell:
>     <input\b.*?\btype\b\W*?\bimage\b
>     \bonkeyup\b\W*?\=

>
> Case 2:
>
>     merge the 4 rules above to one rule as followed with '|' and match
> the same data source as case one.

>
>     \btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=

>
> The following is my test results.
>
>     For case 1, cost about 2.3 seconds to do 100000 matching.
>     For case 2, cost about 24 seconds to do the same thing as case one.

>
> The test results really surprised me. I had thought the case 2 would
> have better performance. But on the contrary, the case 2 has much
> lower performance compared with case 1. I don't know why. Could you
> please explain the reason? Are there any problems with my test case?
> The attached are my test code and data.


I ran pcretest on your 5 patterns to check on the optimizations. This is
what I got:

$ pcretest -i zzz
PCRE version 8.12 2011-01-15

/\btype\b\W*?\btext\b\W*?\bjavascript\b/
Capturing subpattern count = 0
No options
First char = 't'
Need char = 't'

/\burl\b\W*?\bshell:/
Capturing subpattern count = 0
No options
First char = 'u'
Need char = ':'

/<input\b.*?\btype\b\W*?\bimage\b/
Capturing subpattern count = 0
No options
First char = '<'
Need char = 'e'

/\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
First char = 'o'
Need char = '='

/\btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\
btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
No first char
No need char

Notice that the individual patterns all have a value for "First char".
That means, pcre_exec will skip along the subject to that character
before starting a match. The combined pattern does not have this, so it
will try the match at EVERY character.

I tried studying the combined pattern, assuming it would find a starting
byte set of < o t and u (which is what studying does), but it didn't.
This I think is a bug in pcre_study(), which I will investigate (just in
time for the forthcoming 8.13 release). Looks like it is the initial \b
in three patterns that is foxing it, because when I remove that, it does
find the starting bytes.

Philip

--
Philip Hazel