[pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1554

--- Comment #1 from Philip Hazel <ph10@???> 2014-11-30 13:26:15 ---
I understand your requirement, but it is very unlikely ever to be implemented
because checking each character every time it is loaded would slow down the
matching function far too much.

However, I am sure you could do better than calling pcre_exec multiple times.
As these are binary files, it might be better to scan them *without* setting
PCRE_UTF - in other words, to scan them as one-byte characters, if your
patterns are suitable. Literal UTF characters > 128 can be encoded in your
pattern using sequences such as \xe1\x88\xb4 (for character U+1234) or included
as literal binary bytes in your pattern. However, this works only if you are
not relying on the '.' metacharacter matching a whole UTF-8 character instead
of just one byte, and there may be other similar issues. But it would work if
your patterns are relatively straightforward. (Incidentally, if you are just
looking for literal strings, there are much faster algorithms than using a
regular expression.)

Creating a separate version of PCRE with this checking would be possible, but
the behaviour of illegal sequences as "non-matching" needs more definition.
Would they match or not match items like [^a]? And I think there might be
issues in defining the length of an illegal sequence. So I really don't think
this is a good idea.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Vincent Lefevre at
	Vincent Lefevre at

[pcre-dev] [Bug 1554] support subject strings with invalid U…