On Mon, 15 Aug 2011, 徐?z wrote:
> Case 1:
> respectively match data source with the following 4 regular expressions.
>
> \btype\b\W*?\btext\b\W*?\bjavascript\b
> \burl\b\W*?\bshell:
> <input\b.*?\btype\b\W*?\bimage\b
> \bonkeyup\b\W*?\=
>
> Case 2:
>
> merge the 4 rules above to one rule as followed with '|' and match
> the same data source as case one.
>
> \btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=
>
> The following is my test results.
>
> For case 1, cost about 2.3 seconds to do 100000 matching.
> For case 2, cost about 24 seconds to do the same thing as case one.
>
> The test results really surprised me. I had thought the case 2 would
> have better performance. But on the contrary, the case 2 has much
> lower performance compared with case 1. I don't know why. Could you
> please explain the reason? Are there any problems with my test case?
> The attached are my test code and data.
I ran pcretest on your 5 patterns to check on the optimizations. This is
what I got:
$ pcretest -i zzz
PCRE version 8.12 2011-01-15
/\btype\b\W*?\btext\b\W*?\bjavascript\b/
Capturing subpattern count = 0
No options
First char = 't'
Need char = 't'
/\burl\b\W*?\bshell:/
Capturing subpattern count = 0
No options
First char = 'u'
Need char = ':'
/<input\b.*?\btype\b\W*?\bimage\b/
Capturing subpattern count = 0
No options
First char = '<'
Need char = 'e'
/\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
First char = 'o'
Need char = '='
/\btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\
btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
No first char
No need char
Notice that the individual patterns all have a value for "First char".
That means, pcre_exec will skip along the subject to that character
before starting a match. The combined pattern does not have this, so it
will try the match at EVERY character.
I tried studying the combined pattern, assuming it would find a starting
byte set of < o t and u (which is what studying does), but it didn't.
This I think is a bug in pcre_study(), which I will investigate (just in
time for the forthcoming 8.13 release). Looks like it is the initial \b
in three patterns that is foxing it, because when I remove that, it does
find the starting bytes.
Philip
--
Philip Hazel