[pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1554

--- Comment #3 from Philip Hazel <ph10@???> 2014-12-19 10:21:44 ---
On Thu, 18 Dec 2014, Vincent Lefevre wrote:

> > I understand your requirement, but it is very unlikely ever to be implemented
> > because checking each character every time it is loaded would slow down the
> > matching function far too much.
>
> It already does this work when checking UTF-8 validity, doesn't it?

Yes and no. It does *not* check when it is actually doing the matching.
Before starting the matching engine, the subject is checked by a
separate function, implemented as efficiently as possible. This means
that the actual matching engine can assume that it is dealing with a valid
UTF-8 string. It also means that the checking can be disabled by the
PCRE_NO_UTF8_CHECK option when the caller knows that the string is
valid - in particular when matching the same subject string many times.

Checking while actually doing the matching would not only complicate
(and therefore slow down) the matching engine but would be wasteful,
because PCRE's Perl-compatible matching uses a backtracking algorithm,
and so may inspect the same character in the subject more than once.

> Well, I'm using GNU grep (often recursively), and it is not possible to do this
> on a file type basis.

Does GNU grep use PCRE? Oh, is it the -P option? How does it know to use
UTF? The pcregrep program requires you to request UTF-8 processing
explicitly.

> Actually the problem is more general: on files with very short lines, PCRE can
> be very slow for about the same reason.

Is this also with GNU grep? It is just as slow with pcregrep? Can you
provide an example?

> > (Incidentally, if you are just
> > looking for literal strings, there are much faster algorithms than using a
> > regular expression.)
>
> Well, that's a PCRE problem, the goal being to be able to use the same command,
> whether the pattern is a literal string or something more complex. With its own
> regexp support, GNU grep apparently does this optimization (as this can be seen
> with timings). Why not PCRE?

Because I concentrated on producing a fast regex library for non-literal
regex pattern matching. An application that uses the PCRE library can
check for a literal string and do something else, of course. It sounds
as though GNU grep does this, but only when not using PCRE.

> Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE pattern, so
> that it cannot do this optimization on its side.

Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but
checking a Perl pattern for being literal can't be that hard.)

Anyway, there's no point in my being anything other than truthful: as I
said before, it's highly unlikely that anything like this will appear in
PCRE, as least in its current form, and at least from the current
maintainers. I won't be maintaining it for ever, so the longer term is
of course unknown. I do also still believe that the best way to search
"mixed" 8-bit text is to do it in non-UTF-8 mode, treating it as a byte
string rather than a character string, and looking for specific byte
sequences.

Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at
	Vincent Lefevre at

[pcre-dev] [Bug 1554] support subject strings with invalid U…