[pcre-dev] [Bug 1058] New: Feeding bad utf8 strings confuses pcre

Author: Alan Lehotsky
Date:
To: pcre-dev
Subject: [pcre-dev] [Bug 1058] New: Feeding bad utf8 strings confuses pcre_exec()

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1058
           Summary: Feeding bad utf8 strings confuses pcre_exec()
           Product: PCRE
           Version: 7.9
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: bug
          Priority: low
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: alehotsky@???
                CC: pcre-dev@???, alehotsky@???

If I feed a target string like "+\001\177" to pcre_exec() with a pattern of
"[^a-z]", it appears that it grabs the trailing NUL on the string as being the
second byte of a UTF8 character. It then returns a 'match' of {0,4}.

Now, I admit that this is bad Unicode, and it's not hard for me to check that
the return length is actually longer than the source string.

But I suspect that this might be indicative of a more serious bug if you are
using partial matches and the utf8 code point breaks across buffers, and since
no UTF8 character can have a NUL byte within its multibyte sequence, you might
be able to easily detect this and either report an error or return the correct
endpoint.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

This message is part of the following thread:
	the complete thread tree sorted by date

	Philip Hazel at

[pcre-dev] [Bug 1058] New: Feeding bad utf8 strings confuses…