Re: [pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERRO…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: 530
CC: pcre-dev
Subject: Re: [pcre-dev] [Bug 530] pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8
On Tue, 7 Aug 2007, Vincent de Phily wrote:

> Sorry for replying on sourceforge, I didn't know about the exim bugtracker
> myself.


It's relatively recent that PCRE was added to it; for my part, I didn't
know about the sourceforge buglist thing. The whole sourceforge PCRE
project is nothing to do with me! It has been a source of confusion on
several occasions, unfortunately.

> Please look at RFC3629, which obsoletes RFC2279 which you were using. It seems
> that the definition of UTF8 has changed quite a bit between these two versions.


Yes, indeed. Thanks for pointing that out. It seems that I have been
mis-understanding the term "UTF-8" as it is now used.

MY MEANING: "A compact way of encoding a string of 31-bit values in a
byte stream." This is what RFC 2279 defines: it allows from 1 to 6
bytes, for values in the range 0 to 0x7FFFFFFF.

THE WORLD'S MEANING: "A compact way of encoding a string of valid
Unicode code points in a byte stream." This is what RFC 3629 defines: it
allows only from 1 to 4 bytes, for values in the range 0 to 0x0010FFFF,
with the internal sub-range 0xD800 to 0xDFFF prohibited.

> To sum up, I think libPCRE is currently using an outdated definition
> of "UTF8". Byte sequences that were valid aren't anymore.


I checked into the history: the beginnings of UTF-8 support were added
to PCRE in August, 2000. The checking for basic validity was added in
August 2003, and tightened up to exclude "overlong sequences" in
December 2003. I see that RFC 3629 is dated November 2003, so I can
argue that I was probably working from the current standard at the time.

Perhaps we need another name for the 2279-style encoding. I have always
felt that the encoding of wide data values should be a separate issue
from what they represent, but it now seems clear from 3629 that the term
"UTF-8" now has the meaning "a byte encoding for Unicode".

SO: what should be done? There are two options: change the code or
change the documentation.

1. Before I checked the existing code this morning, I was considering
changing the documentation to make it clear that the only reason for the
check was to allow the main code of PCRE to assume it had well-formed
sequences, and thereby not have to check each time. This avoids crashes
for malformed sequences such as a byte with the top bit set at the end
of the string. I was prepared to argue that this is just a "syntax
check", with no semantics intended (such as checking for out-of-range
or prohibited values).

2. However, when I looked at the code, I found that there is more - the
"overlong sequences" check. This goes beyond the simplest "syntax check"
because it insists on a canonical representation of the character. PCRE
has had this check for over 3 years. So some kind of boundary into
"semantic checking" has already been crossed.

3. Looking at the code, it doesn't seem as if it would add very much
more to restrict strings to 3629-style UTF-8. (I am always conscious of
performance issues with PCRE.)

So... I suppose I have to bend with this and add the extra checks. If
anybody does want to use PCRE to match strings of 31-bit values
represented in 2279 fashion, they can do their own checking and turn off
PCRE's check, after all.

Philip

--
Philip Hazel, University of Cambridge Computing Service.