[pcre-dev] [Bug 1027] New: Faster keyword matching

Top Page
Delete this message
Author: Alan Lehotsky
Date:  
To: pcre-dev
New-Topics: [pcre-dev] [Bug 1027] Faster keyword matching, [pcre-dev] [Bug 1027] Faster keyword matching, [pcre-dev] [Bug 1027] Faster keyword matching, [pcre-dev] [Bug 1027] Faster keyword matching
Subject: [pcre-dev] [Bug 1027] New: Faster keyword matching
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1027
           Summary: Faster keyword matching
           Product: PCRE
           Version: N/A
          Platform: All
               URL: http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string
                    _matching_algorithm
        OS/Version: All
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: alehotsky@???
                CC: pcre-dev@???



Using PCRE to do simple keyword matching, e.g. a pattern that looks like

        S1|S2|S3|S4....|Sm


to see if ANY of the S[i]'s are present in the search string could be much
faster.

For comparison, we used an implementation of Aho-Corasick string matching
()
and looked at the performance of PCRE vs AC for various values of 'm'.

Here's what we saw (times are in seconds, and the searched strings are large):

     M       PCRE        AC     RE
    20         19         2     560
   100        244         3       *
   200        574         5       *
   400          *         8       *


e.g PCRE is behaving like O(M) while AC is O(log M). The 4th column is the old
Henry Spencer regular expression code, so it's quite clear that PCRE is
actually doing quite a good job of optimizing this, but it could be
significantly better

Our 'M' values tend to be large, so the O(log M) boundary can be important.

Certainly if there's a better way to write the patterns other than the simple
alternation form used here, I'd love to know about it.

If I were to tackle the project of adding A-C matching to studied PCRE patterns
would you be willing to accept this code into your distribution?


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email