[pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BAD…

Top Page
Delete this message
Author: Vincent de Phily
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=530

Vincent de Phily <moltonel@???> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |moltonel@???
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |





--- Comment #5 from Vincent de Phily <moltonel@???> 2007-08-07 15:52:57 ---
Sorry for replying on sourceforge, I didn't know about the exim bugtracker
myself.


Please look at RFC3629, which obsoletes RFC2279 which you were using. It seems
that the definition of UTF8 has changed quite a bit between these two versions.

I find the RFC a bit hard to read, for the purpose of utf8-checking in my head,
but I still believe I am correct :
* The RFC states that the encoding of numbers between U+D800 and U+DFFF (aka
0x0000DC94 from your rebutal) are prohibited by UTF8.
* The BNF in this RFC looks similar to what can be found in the unicode docs
(which I used as reference to state that \xED\xB2\x94 and \xF4\x92\x94\x95 are
invalid UTF8), or the perl regexp found for example at
http://www.w3.org/International/questions/qa-forms-utf-8.en.php.
* The RFC3629 says itself and UNICODE have been kept in sync up to this date
(november 2003), and even goes as far as stating "The authoritative definition
of UTF-8 is in [UNICODE]."


To sum up, I think libPCRE is currently using an outdated definition of "UTF8".
Byte sequences that were valid aren't anymore.


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email