------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1027
Summary: Faster keyword matching
Product: PCRE
Version: N/A
Platform: All
URL: http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string
_matching_algorithm
OS/Version: All
Status: NEW
Severity: wishlist
Priority: medium
Component: Code
AssignedTo: ph10@???
ReportedBy: alehotsky@???
CC: pcre-dev@???
Using PCRE to do simple keyword matching, e.g. a pattern that looks like
S1|S2|S3|S4....|Sm
to see if ANY of the S[i]'s are present in the search string could be much
faster.
For comparison, we used an implementation of Aho-Corasick string matching
()
and looked at the performance of PCRE vs AC for various values of 'm'.
Here's what we saw (times are in seconds, and the searched strings are large):
M PCRE AC RE
20 19 2 560
100 244 3 *
200 574 5 *
400 * 8 *
e.g PCRE is behaving like O(M) while AC is O(log M). The 4th column is the old
Henry Spencer regular expression code, so it's quite clear that PCRE is
actually doing quite a good job of optimizing this, but it could be
significantly better
Our 'M' values tend to be large, so the O(log M) boundary can be important.
Certainly if there's a better way to write the patterns other than the simple
alternation form used here, I'd love to know about it.
If I were to tackle the project of adding A-C matching to studied PCRE patterns
would you be willing to accept this code into your distribution?
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email