[pcre-dev] [Bug 1554] support subject strings with invalid U…

Page principale
Supprimer ce message
Auteur: Vincent Lefevre
Date:  
À: pcre-dev
Sujet: [pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1554




--- Comment #2 from Vincent Lefevre <vincent@???> 2014-12-18 13:28:35 ---
(In reply to comment #1)
> I understand your requirement, but it is very unlikely ever to be implemented
> because checking each character every time it is loaded would slow down the
> matching function far too much.


It already does this work when checking UTF-8 validity, doesn't it? So, instead
of returning an encoding error, what about doing the matching on the first
chunk, i.e. from the first character to the last one before the invalid UTF-8
sequence? Then, one can also reiterate on the other chunks.

> However, I am sure you could do better than calling pcre_exec multiple times.
> As these are binary files, it might be better to scan them *without* setting
> PCRE_UTF [...]


Well, I'm using GNU grep (often recursively), and it is not possible to do this
on a file type basis.

Actually the problem is more general: on files with very short lines, PCRE can
be very slow for about the same reason.

> (Incidentally, if you are just
> looking for literal strings, there are much faster algorithms than using a
> regular expression.)


Well, that's a PCRE problem, the goal being to be able to use the same command,
whether the pattern is a literal string or something more complex. With its own
regexp support, GNU grep apparently does this optimization (as this can be seen
with timings). Why not PCRE?

Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE pattern, so
that it cannot do this optimization on its side.

> Creating a separate version of PCRE with this checking would be possible, but
> the behaviour of illegal sequences as "non-matching" needs more definition.
> Would they match or not match items like [^a]? And I think there might be
> issues in defining the length of an illegal sequence. So I really don't think
> this is a good idea.


The current behavior is that they never match because an encoding error is
returned. So, the closest definition would be that matching be done by chunks
of valid characters between invalid sequences.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email