------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=530
Philip Hazel <ph10@???> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
--- Comment #4 from Philip Hazel <ph10@???> 2007-08-07 09:58:16 ---
The reason for my delay in responding is, as I said before, the sourceforge bug
list is not something I knew about, nor is it something I monitor. Fortunately,
somebody has now imported it into the "official" PCRE Bugzilla list and so this
came to my attention again.
I think the issue here is that I am talking about UTF-8 and you are talking
about UTF-8 as (possibly) restricted by Unicode. I took my definition of UTF-8
from RFC 2279, and as far as I can see, the sequence \xED \xB2 \x94 is
perfectly valid, and encodes the value 0x0000DC94.
Taking a look at Unicode, I see that this value is in the "low surrogate area",
and there are restrictions on its use. But that is a Unicode restriction, not a
UTF-8 restriction. I am conscious that UTF-8 and Unicode are often thought of
as the same thing, though they are in fact separate.
All PCRE is interested in is ensuring the UTF-8 characters are well-formed, in
the interests of avoiding crashes during its processing. That is why it applies
just the (relatively fast) check for valid UTF-8 and doesn't concern itself
with Unicode interpretations at that stage. I will add some words to the
documentation.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email