Re: [pcre-dev] [Bug 1027] New: Faster keyword matching

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 1027
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 1027] New: Faster keyword matching
On Fri, 1 Oct 2010, Alan Lehotsky wrote:

> Using PCRE to do simple keyword matching, e.g. a pattern that looks like
>
>         S1|S2|S3|S4....|Sm

>
> to see if ANY of the S[i]'s are present in the search string could be much
> faster.


Using a regular expression matcher to look for simple keywords is like
using a sledgehammer to crack a nut. (It's messy, and it takes a long
time to get what you want.) There are (as you have found) much faster
ways of doing the job. IMHO it's the wrong tool for the job. I suspect
that even a plain straightforward loop that does a basic string match of
S1, S2, etc as you move along the subject string would be faster.

Why are you using something as complicated and general as a regex
matcher to do a simple keyword match?

I knew that PCRE was better than the Spencer code; when I first wrote
it, back in 1998, I did some comparisons to check. I presume you were
using pcre_exec() rather than pcre_dfa_exec() in your tests.

Were you timing the combined used of pcre_compile() plus pcre_exec() or
just pcre_exec()? Likewise, for the A-C algorithm, do the times include
the preparation time?

> Certainly if there's a better way to write the patterns other than the simple
> alternation form used here, I'd love to know about it.


Except for possibly pulling out common prefixes, I doubt it. Writing
(tweedled(?:um|ee)) might be faster than (tweedledum|tweedledee) I
suspect... hmm... a short test shows that is true, but it rather depends
on the form of the data you are searching. It doesn't benefit much if
there are only a few occurrences of the common prefix that are *not*
followed by matching suffix.

> If I were to tackle the project of adding A-C matching to studied PCRE
> patterns would you be willing to accept this code into your
> distribution?


I am always willing to look at code contributions, but is this really
worth what might be a lot of effort? Most uses of PCRE are not going to
be simple string matches - the whole point of a regex is in the "fancy"
features such as repetition, wild cards, nested subexpressions, etc.

Philip

--
Philip Hazel