Re: [Exim] Interpreting Subject: lines

Author: Philip Hazel
Date:
To: exim-users
Subject: Re: [Exim] Interpreting Subject: lines - opinions, please

OK, I've done some further thinking based on the feedback I've had.

1. It still seems to me that the right thing to do is to change $h_ so
that it has the best "automatic" chance of doing the right thing,
without the end user having to specify anything special.

2. The $rh_ facility already exists for those that want to look at the
raw bytes in a header line. If you are inspecting headers for code
names, now is a good time to change.

3. Just decoding MIME "words" from printed characters into binary isn't
really good enough, because it loses information about the character
set. If I just do this, it will work some of the time for some
people, but the issue will still cause problems.

4. Going all the way, and taking account of character sets is messy.
Whatever is done is never going to be perfect. (This problem won't go
away till everybody is using Unicode everywhere. I think I'll be long
gone by then.) However, a semi-reasonable attempt can be made.

5. I therefore propose:

   $rh_    will remain unchanged, as the "raw" header data
   $bh_    will be new feature, containing the decoded-into-bytes data,
           but without any character set conversion
   $h_     will be changed so that the character set conversion is done,
           on those OS that support the iconv() function.
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I have been able to check a few OS. As far as I can tell Solaris, Linux,
Irix, OSF1, FreeBSD, and HP-UX do support iconv(). SunOS4 and BSDI
appear not to. If you have access to any other operating system, please
check and let me know - the easiest check is "man iconv", but the
existence of /usr/include/iconv.h is another test. On those OS that are
not known to support iconv(), $h_ will be the same as $bh_.

The messiness is in specifying the character set to which
MIME-encoded "words" in header lines are converted. I propose to
implement the following:

   (i)    A build-time setting HEADER_DECODE_TO that defaults to
          "ISO-8859-1" (if the OS has iconv() support).

   (ii)   A runtime option called header_decode_to that can override the
          default.

   (iii)  A new filter command called header_decode_to that can set a
          value for use while interpreting a filter. This allows
          different users to use different encodings for the text in
          their filters.

For those operating systems that are not known to support iconv(),
attempts to set header_decode_to (either in the config or in a filter)
will provoke errors.

The actual decoding and conversion will be implemented in a separate
function that can also be used by patches, and from local_scan() etc.
The currently proposed API is

uschar *rfc2047_decode(uschar *source, uschar *target, uschar **error)

where source is the source string and target is the name the required
character set. If target is NULL, no conversion is done. The new string
is the result. If something goes wrong, NULL is returned, with a message
in error.

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Michael Haardt at