------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1554
--- Comment #4 from Vincent Lefevre <vincent@???> 2014-12-20 00:56:31 ---
(In reply to comment #3)
> Yes and no. It does *not* check when it is actually doing the matching.
> Before starting the matching engine, the subject is checked by a
> separate function, implemented as efficiently as possible.
I meant that after doing the check, there should be a way to do the matching
from the beginning of the subject string to the first invalid byte (not
included), instead of returning an encoding error. Otherwise grep needs to do 2
calls to pcre_exec, and this is slower. But see below.
> > Well, I'm using GNU grep (often recursively), and it is not possible to
> > do this on a file type basis.
>
> Does GNU grep use PCRE? Oh, is it the -P option?
Yes, with the -P option.
> How does it know to use UTF?
With -P, GNU grep assumes that the UTF-8 encoding is used, possible with
invalid sequences. Currently it calls pcre_exec with the check, and in case of
encoding error, it calls pcre_exec again without the check from the beginning
of the subject string to the first invalid byte (not included), and it
reiterates. But this seems to be a bad solution:
$ time grep -P zzz file.pdf
grep -P zzz file.pdf 6.64s user 0.04s system 99% cpu 6.692 total
$ time pcregrep zzz file.pdf
pcregrep zzz file.pdf 0.71s user 0.04s system 98% cpu 0.758 total
But note that grep with its own regexp engine wins:
$ time grep zzz file.pdf
grep zzz file.pdf 0.06s user 0.04s system 77% cpu 0.135 total
Similar timings with the pattern 'z[0-9]zzz' (which is not a literal string).
This shows that pcregrep could still be improved in some cases (though this may
not be related to UTF-8 validity, and on some patterns, PCRE is sometimes
faster than the GNU grep regexp engine, as seen below).
> > Actually the problem is more general: on files with very short lines,
> > PCRE can be very slow for about the same reason.
>
> Is this also with GNU grep? It is just as slow with pcregrep? Can you
> provide an example?
Actually pcregrep is a bit slower than grep -P! On a file with 10,000,000 lines
containing only "a":
$ time grep zzz file
grep zzz file 0.00s user 0.00s system 55% cpu 0.007 total
$ time grep -P zzz file
grep -P zzz file 0.55s user 0.01s system 96% cpu 0.578 total
$ time pcregrep zzz file
pcregrep zzz file 0.64s user 0.00s system 97% cpu 0.659 total
$ time grep '[0-9]' file
grep '[0-9]' file 3.73s user 0.01s system 99% cpu 3.749 total
$ time grep -P '[0-9]' file
grep -P '[0-9]' file 0.56s user 0.00s system 95% cpu 0.590 total
$ time pcregrep '[0-9]' file
pcregrep '[0-9]' file 0.64s user 0.00s system 97% cpu 0.663 total
> Because I concentrated on producing a fast regex library for non-literal
> regex pattern matching. An application that uses the PCRE library can
> check for a literal string and do something else, of course. It sounds
> as though GNU grep does this, but only when not using PCRE.
Actually, grep is faster than the PCRE library on particular patterns, not just
literal strings, as seen above.
> > Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE
> > pattern, so that it cannot do this optimization on its side.
>
> Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but
> checking a Perl pattern for being literal can't be that hard.)
It's also a pcregrep problem, and not solved there either! Anyway, even
patterns like 'z[0-9]zzz' are concerned, as seen above.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email