[pcre-dev] Matching file contents as one string using PCRE

Author: ph10
Date:
To: pcre-dev
CC: pcunite
Subject: [pcre-dev] Matching file contents as one string using PCRE_DOTALL

pcunite@??? wrote:

> Goals:
> 1. Load a text file into a std::wstring buffer.
> 2. Have no regard for the concept of "lines". One big string is fine.
> 3. Use Positive Lookahead to find terms in ANY order.
> 4. I don't ever care about capturing or using LookBehinds.
> 5. Any character before or after the query terms is fine.
>
> Regex:
> Thus I have come up with this: (?=.*hello.*)(?=.*world.*)

Some comments about that regex, independent of DOTALL:

1. You shouldn't put .* after the string you want because that just
wastes time scanning the rest of the string (which you know will match).

2. By starting each group with .* you in effect search from the end of
the string backwards, because .* swallows the whole string and then has
to back off. If you use .*? instead (a minimizing *) the search is from
the start of the string. Whether this matters depends on the length of
the string and whether you expect to find the terms nearer one end than
the other (on average).

3. One could construct a much more complicated regex that looks for both
terms at once, then continues for the one that it didn't find. This
might be faster for a small number of terms and a long subject string,
but I stress "might" - the extra complication may negate any savings
from not re-scanning. I'm thinking of a pattern like this:

(?=(hello)|world).*?(?(1)world|hello)

The condition (?(1) tests whether group 1 is set. This pattern moves
along the string until it finds "hello" or "world". A minimizing .*? is
then followed by the conditional. This approach rapidly gets more
complicated for more than 2 strings and in such a case I would advise
writing a program to generate the regex.

4. Finally, if you are indeed looking for fixed strings, using a regex
is almost certainly not the fastest way to do it, though it may be the
most convenient. Particularly when searching long strings, there are
specialized fixed-string search algorithms (I seem to recall that
Boyer-Moore was the first of them) that run much faster. Again, for
short strings and/or one-off searches, this may not matter.

Philip

--
Philip Hazel

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at
	pcunite at

[pcre-dev] Matching file contents as one string using PCRE_D…