[pcre-dev] [Bug 2793] Case insensitive search gets exponenti…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2793] Case insensitive search gets exponentially slower with larger buffers and a specific text file
https://bugs.exim.org/show_bug.cgi?id=2793

--- Comment #3 from Thomas Tempelmann <tempelmann@???> ---
Oh, re-reading your explanation, I have a suspicion where the problem lies:

You wrote:

> when you specify caseless matching, PCRE2 will try to match EDL (caselessly) at the very many instances of "e" that exist in the file.


and:

> There is a limit on how much text the optimization will search


Here's what probably happens:

1. Fast Scan for "E". Takes long because it's down 5 MB
2. Now it goes back to the start, finds "e" and checks the rest of the search
string.
3. No match. So, it moves forward.

At this point in the loop, instead of going on with step 2, it goes back to
step 1, where it again searches ahead for 5 MB until it runs into the first
"E".

I can think of several remedies:

Change the fast scan to include searching all possible options. In my example,
it has to scan for both "e" and "E". I assume this benefits by using a
specialized CPU instructions that can scan for a byte (because, if not, you'd
simply do a loop where you get one byte and check it then against both "e" and
"E")? So, what you'd need to do is to use that search operation in small
ranges, e.g. over 1000 bytes, looking for "e", and then the same 1000 bytes
looking for "E". If none hits, move forward. But if one hits, the _nearer_ one
is processed (and the farther one's position can be cached so that you won't
need to search again for it until you've moved there).

But currently, instead, I suspect it's searching for "E" and eventually gives
up when it reaches the cut-off point you mentioned. But then repeats the same
long search again and again.

So, the next possible optimization, which may be much easier to implement than
the first suggestion, is to simply cache the point at which the "E" was found
or not, and then not repeat looking for "E" before that point.

Actually, can you tell me where this happens (if you don't have time to look
now, can you give me some pointers where to look)? I like to try the caching
myself, it shouldn't be too hard I hope.

--
You are receiving this mail because:
You are on the CC list for the bug.