[pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BAD…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=530

Philip Hazel <ph10@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID





--- Comment #4 from Philip Hazel <ph10@???> 2007-08-07 09:58:16 ---
The reason for my delay in responding is, as I said before, the sourceforge bug
list is not something I knew about, nor is it something I monitor. Fortunately,
somebody has now imported it into the "official" PCRE Bugzilla list and so this
came to my attention again.

I think the issue here is that I am talking about UTF-8 and you are talking
about UTF-8 as (possibly) restricted by Unicode. I took my definition of UTF-8
from RFC 2279, and as far as I can see, the sequence \xED \xB2 \x94 is
perfectly valid, and encodes the value 0x0000DC94.

Taking a look at Unicode, I see that this value is in the "low surrogate area",
and there are restrictions on its use. But that is a Unicode restriction, not a
UTF-8 restriction. I am conscious that UTF-8 and Unicode are often thought of
as the same thing, though they are in fact separate.

All PCRE is interested in is ensuring the UTF-8 characters are well-formed, in
the interests of avoiding crashes during its processing. That is why it applies
just the (relatively fast) check for valid UTF-8 and doesn't concern itself
with Unicode interpretations at that stage. I will add some words to the
documentation.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email