Re: [Exim] Interpreting Subject: lines - opinions, please

Top Page
Delete this message
Reply to this message
Author: Philip Hazel
Date:  
To: Dr Andrew C Aitchison
CC: exim-users
Subject: Re: [Exim] Interpreting Subject: lines - opinions, please
On Tue, 8 Jul 2003, Dr Andrew C Aitchison wrote:

> I doubt that many users will wish to care about the encoding of the
> message text, so we need to convert both the message and the comparison
> string into the same encoding.


Quite.

> *iconv appears to be the standard way of converting between any
> two encodings, and ships with Solaris 7 and RedHat 6.2 at least.


I certainly do not propose to do anything if the host doesn't support
iconv. (The supplied patch has a few built-ins, but I don't think that's
right.)

> How about the following:
> $rh_ does a byte wise comparison after converting both strings from
> the MIME encoding,


$rh_ is "raw header", so must remain, I think, the actual sequence of
bytes in the header, unmodified in any way.

If we want to have something that specifies "decode from printing chars
to binary, but ignore the character set", it will have to be some new
thing, for example $bh_.

> whilst $h_ assumes that the comparison string is in a particular encoding
> (given in a server default unless overridden in the filter file in use),


Yes, that's what the supplied patch does.

> This way I can use $h_ to test for "internet café" without caring
> whether your message uses iso-8859-1, iso-8859-15 or UTF-8,
> provided only that my sysadmin ensures that my editor and the exim config
> agree how the string is stored in my filter file.


Yes, that is one approach. However, it isn't very general. Consider
various pre-built distributions of Exim that get installed all over the
world. A single character set is probably not good enough, but people
aren't going to understand the problem. Even on one host, there may be
users logging in from different parts of the world, and setting their
text editors to use different character sets.

I fear a support effort sink.

I think we've narrowed it down to the following possibilities:

A: Just decode to "binary"; ignore character set information. Is this
actually going to be helpful enough to be worth doing?

B: Fully convert to some default character set for $h_ (possibly
supporting $bh_ for "binary" only) on systems that have iconv().

There could be a default character set (probably iso-8859-1 would
be more useful than Unicode-in-UTF-8, but that's arguable, I
suppose). Given that the conversion is done, it wouldn't be much
extra to allow the sysadmin to override, and indeed to allow the user
in a filter to override. In effect, to say "the strings in this
filter are in charset X". We would also need something for the config
file to say the same thing.

B is more work to implement, and potentially a lot more work to support,
because the whole character set area is a can of worms which few people
actually understand. But of course, in a lot of cases (certainly in
Western Europe) it might "just work" well enough to be useful.

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book