Re: [exim] exim.org filter spec error -- not found

Top Page
Delete this message
Reply to this message
Author: Philip Hazel
Date:  
To: Bill Hacker
CC: exim-users
Subject: Re: [exim] exim.org filter spec error -- not found
On Wed, 14 Dec 2005, Bill Hacker wrote:

> Simply set 'UTF-8' in the meta-data of the webpage.
>
> ISO-8859-1 is a (mostly) proper subset, but not the reverse.


That isn't strictly true. Confusion arises because UTF-8 is not,
strictly, a character encoding. It is a way of encoding (compressing,
really) a sequence of numbers whose values need up to 24 bits to
represent in binary into a string of 8-bit bytes, where the first 128
numbers are represented by single bytes.

Unicode is a character encoding that defines character code points, also
values up to 24 bits, though the majority are within the 16 bit limit.
Unicode is often represented using the UTF-8 value encoding, but not
always. Some applications use straight 16-bit values. However, in the
context of many applications, including, it seems, the web, the name
"UTF-8" has become synonymous with "Unicode, encoded as UTF-8".

ISO-8859-1 code values are a subset of Unicode code values. However,
ISO-8859-1 code values are always represented as single bytes. This
means that values 0-127 are indeed identical to the UTF-8 values 0-127.
However, the remaining ISO-8859-1 code points (128-255), though they
encode the same characters as Unicode, are not represented in the same
way. In ISO-8859-1 these values are single bytes; in Unicode/UTF-8 they
require two bytes. Take, for example, the character whose Unicode and
ISO-8859-1 code point is 00F7 (the divide symbol). In ISO-8859-1 this
would be the single byte with hex value F7; in UTF-8 this value is coded
as two bytes C3, B7.

Therefore, if you have a file that contains ISO-8859-1 and it contains
characters in the range 128-255, you cannot just pretend that it is
UTF-8 Unicode. In fact, it will most probably be invalid as a UTF-8 file
because the bytes with the top bit set won't, in general, form valid
UTF-8 sequences. Some of them, though (e.g. the sequence C3, B7) will be
valid as UTF-8. So you will get a mess.

-- 
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book