[pcre-dev] [Bug 1027] Faster keyword matching

Autor: Alan Lehotsky
Data:
A: pcre-dev
Assumpte: [pcre-dev] [Bug 1027] Faster keyword matching

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1027

--- Comment #2 from Alan Lehotsky <alehotsky@???> 2010-10-01 19:27:31 ---
Thanks.

I agree that using RE for this problem is 'overkill' (or at least not
likely to perform optimally). My personal opinion is that O(N) behavior
is pretty darn good - but I have a user who saw the 'sledgehammer' in the
tool shed and pounced on it. Certainly expressing this as a
regular-expression matching problem is simple from the user-perspective of
not having to learn about more complex data-structures and algorithms.

In our case, it is a very big walnut we're trying to crack, and it's
perfectly acceptable to use a direct implementation of A-C matching to get
there.

I also thought of the prefix factoring as a performance win - but of
course in general there is very little commonality in the keywords being
used to search by.

My timing numbers were for pcre_compile() + pcre_exec() and the A-C
numbers included setup of the data-structures as well as lookups.

Just thought this might be an interesting optimization to consider - but I
agree that it's definitely an edge case, which is why I wasn't clamoring
for you to implement this optimization, rather trying more to gauge the
interest in adding this kind of capability to PCRE.

Regards,
Al Lehotsky

Philip Hazel <ph10@???>
Sent by: admin@???
10/01/2010 12:25 PM
Please respond to
1027@???

To
alehotsky@???
cc

Subject
[Bug 1027] Faster keyword matching

------- You are receiving this mail because: -------
You reported the bug.

http://bugs.exim.org/show_bug.cgi?id=1027

--- Comment #1 from Philip Hazel <ph10@???> 2010-10-01
17:25:17 ---
On Fri, 1 Oct 2010, Alan Lehotsky wrote:

> Using PCRE to do simple keyword matching, e.g. a pattern that looks like
>
> S1|S2|S3|S4....|Sm

>
> to see if ANY of the S[i]'s are present in the search string could be
much
> faster.

Using a regular expression matcher to look for simple keywords is like
using a sledgehammer to crack a nut. (It's messy, and it takes a long
time to get what you want.) There are (as you have found) much faster
ways of doing the job. IMHO it's the wrong tool for the job. I suspect
that even a plain straightforward loop that does a basic string match of
S1, S2, etc as you move along the subject string would be faster.

Why are you using something as complicated and general as a regex
matcher to do a simple keyword match?

I knew that PCRE was better than the Spencer code; when I first wrote
it, back in 1998, I did some comparisons to check. I presume you were
using pcre_exec() rather than pcre_dfa_exec() in your tests.

Were you timing the combined used of pcre_compile() plus pcre_exec() or
just pcre_exec()? Likewise, for the A-C algorithm, do the times include
the preparation time?

> Certainly if there's a better way to write the patterns other than the
simple
> alternation form used here, I'd love to know about it.

Except for possibly pulling out common prefixes, I doubt it. Writing
(tweedled(?:um|ee)) might be faster than (tweedledum|tweedledee) I
suspect... hmm... a short test shows that is true, but it rather depends
on the form of the data you are searching. It doesn't benefit much if
there are only a few occurrences of the common prefix that are *not*
followed by matching suffix.

> If I were to tackle the project of adding A-C matching to studied PCRE
> patterns would you be willing to accept this code into your
> distribution?

I am always willing to look at code contributions, but is this really
worth what might be a lot of effort? Most uses of PCRE are not going to
be simple string matches - the whole point of a regex is in the "fancy"
features such as repetition, wild cards, nested subexpressions, etc.

Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

Aquest missatge és part del següent fil:
	l'arbre de fils complet ordenat per data
	Philip Hazel en
	Philip Hazel en