[pcre-dev] [Bug 1058] Feeding bad utf8 strings confuses pcre

Author: Philip Hazel
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1058] Feeding bad utf8 strings confuses pcre_exec()

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1058

--- Comment #1 from Philip Hazel <ph10@???> 2011-01-06 16:19:01 ---
On Wed, 5 Jan 2011, Alan Lehotsky wrote:

> If I feed a target string like "+\001\177" to pcre_exec() with a pattern of
> "[^a-z]", it appears that it grabs the trailing NUL on the string as being the
> second byte of a UTF8 character. It then returns a 'match' of {0,4}.
>
> But I suspect that this might be indicative of a more serious bug if you are
> using partial matches and the utf8 code point breaks across buffers,

I see you were using PCRE 7.9. Please re-check with the latest release,
8.11, which has the following change:

16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish
between a bad UTF-8 sequence and one that is incomplete when using
PCRE_PARTIAL_HARD.

There have also been other UTF-8 changes since 7.9.

Philip

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date
	Alan Lehotsky at
	Philip Hazel at

[pcre-dev] [Bug 1058] Feeding bad utf8 strings confuses pcre…