[pcre-dev] Searching a big file with PCRE

Auteur: Maël Hörz
Date:
À: pcre-dev
Sujet: [pcre-dev] Searching a big file with PCRE

Hello,

I am not sure if this is the right place to ask, if not please direct me
to a more appropriate one.

I would like to search big files with PCRE, therefore it is not
practical to load them into a buffer in one chunk. But as the
subject-parameter of pcre_exec has to be a buffer and it doesn't seem to
accept files or streams the file has to be read in chunks that are
passed step by step to pcre_exec.

There are a couple of problems with this chunked approach:
1.) It might happen that the file contains the pattern, but that it is
spread across two chunks (the chunks are read from the file to search).
E.g. if my pattern is "abcd" (and the file also contains "abcd"
somewhere) then "ab" might be on the end of the first chunk and "cd"
right at the beginning of the second chunk. Because pcre_exec is passed
"ab" and "cd" separately, it will not detect the whole pattern "abcd"
though the file contains it.
Stated more generally:
Because of these buffer-boundary/wrap-around issues possibly not all
occurrences of a pattern are found in the file.

2.) If I read a file in a chunked manner how to correctly handle
expressions such as "a.*b", so that everything is matched as if it was
read in one big chunk? For example one chunk might end with "axxxb" and
the next chunk is "yyyyb". In this case "axxxb" would be matched. But if
everything was placed together in one chunk then the entire string
"axxxbyyyyyb" would be matched.

How can I deal with the problems mentioned above? Is there a good
approach on searching big files with PCRE?

Regards, Maël Hörz.