------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=530
Vincent de Phily <moltonel@???> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |moltonel@???
Status|RESOLVED |REOPENED
Resolution|INVALID |
--- Comment #5 from Vincent de Phily <moltonel@???> 2007-08-07 15:52:57 ---
Sorry for replying on sourceforge, I didn't know about the exim bugtracker
myself.
Please look at RFC3629, which obsoletes RFC2279 which you were using. It seems
that the definition of UTF8 has changed quite a bit between these two versions.
I find the RFC a bit hard to read, for the purpose of utf8-checking in my head,
but I still believe I am correct :
* The RFC states that the encoding of numbers between U+D800 and U+DFFF (aka
0x0000DC94 from your rebutal) are prohibited by UTF8.
* The BNF in this RFC looks similar to what can be found in the unicode docs
(which I used as reference to state that \xED\xB2\x94 and \xF4\x92\x94\x95 are
invalid UTF8), or the perl regexp found for example at
http://www.w3.org/International/questions/qa-forms-utf-8.en.php.
* The RFC3629 says itself and UNICODE have been kept in sync up to this date
(november 2003), and even goes as far as stating "The authoritative definition
of UTF-8 is in [UNICODE]."
To sum up, I think libPCRE is currently using an outdated definition of "UTF8".
Byte sequences that were valid aren't anymore.
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email