------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=530
--- Comment #6 from Craig Silverstein <csilvers@???> 2007-08-07 17:40:06 ---
While I'm no utf-8 wizard, the standard does seem to support Vincent's
position. The controlling doc for the utf-8 definition appears to be
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404
On page 25 of the pdf file (listed as page 78 in the chapter), there's a table
of well-formed utf-8, and the table says that if the first byte is ED, the
second byte has to be in the range 80..9F, not 80..BF like every other
byte-sequence. There are three other, similar exceptions: if the first byte is
E0, the second byte must be in the range A0..BF; if the first byte is F0, the
second byte must be in the range 90..BF, and if the firest byte is F4, the
second byte must be in the range 80..8F.
On the previous page, the doc says explicitly "Any UTF-8 byte sequence that
would otherwise map to code points D800..DFFF is ill-formed."
For convenience, I've attached the full table below. Hopefully it will look
good in a browser!
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email