[pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

Author: Vincent Lefevre
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1554

--- Comment #4 from Vincent Lefevre <vincent@???> 2014-12-20 00:56:31 ---
(In reply to comment #3)
> Yes and no. It does *not* check when it is actually doing the matching.
> Before starting the matching engine, the subject is checked by a
> separate function, implemented as efficiently as possible.

I meant that after doing the check, there should be a way to do the matching
from the beginning of the subject string to the first invalid byte (not
included), instead of returning an encoding error. Otherwise grep needs to do 2
calls to pcre_exec, and this is slower. But see below.

> > Well, I'm using GNU grep (often recursively), and it is not possible to
> > do this on a file type basis.
>
> Does GNU grep use PCRE? Oh, is it the -P option?

Yes, with the -P option.

> How does it know to use UTF?

With -P, GNU grep assumes that the UTF-8 encoding is used, possible with
invalid sequences. Currently it calls pcre_exec with the check, and in case of
encoding error, it calls pcre_exec again without the check from the beginning
of the subject string to the first invalid byte (not included), and it
reiterates. But this seems to be a bad solution:

$ time grep -P zzz file.pdf
grep -P zzz file.pdf 6.64s user 0.04s system 99% cpu 6.692 total
$ time pcregrep zzz file.pdf
pcregrep zzz file.pdf 0.71s user 0.04s system 98% cpu 0.758 total

But note that grep with its own regexp engine wins:

$ time grep zzz file.pdf
grep zzz file.pdf 0.06s user 0.04s system 77% cpu 0.135 total

Similar timings with the pattern 'z[0-9]zzz' (which is not a literal string).

This shows that pcregrep could still be improved in some cases (though this may
not be related to UTF-8 validity, and on some patterns, PCRE is sometimes
faster than the GNU grep regexp engine, as seen below).

> > Actually the problem is more general: on files with very short lines,
> > PCRE can be very slow for about the same reason.
>
> Is this also with GNU grep? It is just as slow with pcregrep? Can you
> provide an example?

Actually pcregrep is a bit slower than grep -P! On a file with 10,000,000 lines
containing only "a":

$ time grep zzz file
grep zzz file 0.00s user 0.00s system 55% cpu 0.007 total
$ time grep -P zzz file
grep -P zzz file 0.55s user 0.01s system 96% cpu 0.578 total
$ time pcregrep zzz file
pcregrep zzz file 0.64s user 0.00s system 97% cpu 0.659 total

$ time grep '[0-9]' file
grep '[0-9]' file 3.73s user 0.01s system 99% cpu 3.749 total
$ time grep -P '[0-9]' file
grep -P '[0-9]' file 0.56s user 0.00s system 95% cpu 0.590 total
$ time pcregrep '[0-9]' file
pcregrep '[0-9]' file 0.64s user 0.00s system 97% cpu 0.663 total

> Because I concentrated on producing a fast regex library for non-literal
> regex pattern matching. An application that uses the PCRE library can
> check for a literal string and do something else, of course. It sounds
> as though GNU grep does this, but only when not using PCRE.

Actually, grep is faster than the PCRE library on particular patterns, not just
literal strings, as seen above.

> > Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE
> > pattern, so that it cannot do this optimization on its side.
>
> Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but
> checking a Perl pattern for being literal can't be that hard.)

It's also a pcregrep problem, and not solved there either! Anyway, even
patterns like 'z[0-9]zzz' are concerned, as seen above.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Philip Hazel at

[pcre-dev] [Bug 1554] support subject strings with invalid U…