[pcre-dev] [Bug 2793] Case insensitive search gets exponenti…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2793] Case insensitive search gets exponentially slower with larger buffers and a specific text file
https://bugs.exim.org/show_bug.cgi?id=2793

--- Comment #2 from Thomas Tempelmann <tempelmann@???> ---
Philip, thank you for looking into this.

A few thoughts:

As I search case-insensitively, shouldn't the fast scan also have found the
occurances of "e", i.e. treated it the same as "E"? And since "e" appears about
as often in both files, this should then not lead to a performance difference,
right?

In fact, if I search for "eDL" instead, then both searches are fast. That's
probably because then the fast scan is able to see each "e" and then perform a
check on the following chars, and quickly determine that they do not match, and
then continue looking for the next "e".

But with the fast scan looking for "E", the case-sensitivity still somehow
needs to be observed. The first "E" appears only halfway down the file, at
around 5 MB. I wonder if, once it's found, the algorithm "realizes" that it
skipped a lot of data without considering the case-insensitive, i.e. that it
also needs to look for "e" within that skipped data. Indeed, if I add a few
more "E" within the first 5MB of the file, search becomes faster again.

Also, if I search for "xDL", with "x" not appearing in either file, I see that
this fast scan having the desired effect: It gets about 5x faster than when
searching for "eDL".

So there seems to be something wrong with how the fast scan behaves with
case-insensitivity.

Skipping the fast scan with PCRE2_NO_START_OPTIMIZE is an option, but since the
optimization helps SO much with what I need from PCRE, i.e. tell me simply
whether a file contains the search string and nothing more, I'd like to keep
this working.

Also, I think it's important to note that with JIT, this performance issue does
not surface. My guess is that the fast scan is not used with JIT because the
JIT implements code that will do practically the same as the fast scan does,
that is, look for the starting match (here: "e" or "E"), via a state machine?
Whereas, without JIT, the search for the first state by using a table is a
performance pit compared to simply scanning all bytes until you find the "E"?

--
You are receiving this mail because:
You are on the CC list for the bug.