https://bugs.exim.org/show_bug.cgi?id=2793
Bug ID: 2793
Summary: Case insensitive search gets exponentially slower with
larger buffers and a specific text file
Product: PCRE
Version: 10.37 (PCRE2)
Hardware: x86-64
OS: All
Status: NEW
Severity: bug
Priority: medium
Component: Code
Assignee: Philip.Hazel@???
Reporter: tempelmann@???
CC: pcre-dev@???
Created attachment 1395
-->
https://bugs.exim.org/attachment.cgi?id=1395&action=edit
main.c, 1.txt, 2.txt
I have two log files. In both, every line is 91 chars long, having only ASCII
chars. The first has the same line repeated all over. The other has "real" log
lines, with ever-changing time codes. Both are about 10 MB in size.
When I search the first, it takes milliseconds, but searching the other takes
many seconds, and that's clearly wrong.
If I double the file / buffer sizes, the time explodes (i.e. it does not simply
double in size) only with the second file.
Also, this only happens in non-jit mode, and only when I choose the
case-insensitive option. And if I try the same with the built pcre2grep
command, using the options "--buffer-size=32M --no-jit -i", it's also not
reproducible. Only going wrong with my own code.
Here's the code I use to read and search each file. It's as simple as it can
get, I think.
const char *find = "EDL";
uint32_t regexOptions = PCRE2_CASELESS; // without this, it's fast as
expected
int errNum = 0; PCRE2_SIZE errOfs = 0;
pcre2_code *regEx2 = pcre2_compile_8 ((PCRE2_SPTR)find,
PCRE2_ZERO_TERMINATED, regexOptions, &errNum, &errOfs, NULL);
pcre2_match_data *regEx2Match = pcre2_match_data_create_from_pattern (regEx2,
NULL);
// read from file
size_t dataLen = 10 * 1024 * 1024; // 20 MB
void *dataPtr = malloc (dataLen);
int fd = open ("2.txt", O_RDONLY);
dataLen = read (fd, dataPtr, dataLen);
pcre2_match_8 (regEx2, (PCRE2_SPTR8)dataPtr, dataLen, 0, 0, regEx2Match,
NULL);
Attached is the complete "main.c" plus the two text files, zipped (it
compressed quite well, to about 400 KB)
--
You are receiving this mail because:
You are on the CC list for the bug.