[pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BAD…

Top Page
Delete this message
Author: Craig Silverstein
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=530




--- Comment #6 from Craig Silverstein <csilvers@???> 2007-08-07 17:40:06 ---
While I'm no utf-8 wizard, the standard does seem to support Vincent's
position. The controlling doc for the utf-8 definition appears to be
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404

On page 25 of the pdf file (listed as page 78 in the chapter), there's a table
of well-formed utf-8, and the table says that if the first byte is ED, the
second byte has to be in the range 80..9F, not 80..BF like every other
byte-sequence. There are three other, similar exceptions: if the first byte is
E0, the second byte must be in the range A0..BF; if the first byte is F0, the
second byte must be in the range 90..BF, and if the firest byte is F4, the
second byte must be in the range 80..8F.

On the previous page, the doc says explicitly "Any UTF-8 byte sequence that
would otherwise map to code points D800..DFFF is ill-formed."

For convenience, I've attached the full table below. Hopefully it will look
good in a browser!

Code Points        1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF   80..BF
U+0800..U+0FFF     E0       A0..BF   80..BF
U+1000..U+CFFF     E1..EC   80..BF   80..BF
U+D000..U+D7FF     ED       80..9F   80..BF
U+E000..U+FFFF     EE..EF   80..BF   80..BF
U+10000..U+3FFFF   F0       90..BF   80..BF   80..BF
U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF
U+100000..U+10FFFF F4       80..8F   80..BF   80..BF



--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email