Re: [Exim] CRLF input through pipe causes mangled headers

Author: Philip Hazel
Date:
To: Barry Pederson
CC: exim-users
Subject: Re: [Exim] CRLF input through pipe causes mangled headers

On Tue, 22 Jul 2003, Barry Pederson wrote:

> No no no, in this case with Cyrus, the bare <CR>s (not paired with an LF) are
> not in the lines of the email message itself, *EXIM* puts them there, that
> can't be the right thing. Exim takes: "This is a line.<CRLF>", and
> transmits it as: "This is a line.<CRCRLF>".

That is only true when the source of the line is a non-SMTP input. Exim
currently thinks such lines should be terminated according to the Unix
conventions, since it is running on a Unix system.

Note: I have already said that I propose to do something about this (see
below). No need to keep on arguing!

> >> In addition, the appearance of "bare" "CR" or "LF" characters in text
> >> (i.e., either without the other) has a long history of causing
> >> problems in mail implementations and applications that use the mail
> >> system as a tool. SMTP client implementations MUST NOT transmit
> >> these characters except when they are intended as line terminators
> >> and then MUST, as indicated above, transmit them only as a <CRLF>
> >> sequence.

That's all very well, but this begs the same question: what should Exim
do if a bare CR appears in an incoming message? This is the same
question as what should Exim do if a top-bit-set character appears in an
incoming message? At present (and I stress *at present*) Exim by default
does the same thing in both cases - it just transmits the character.

There was once an MTA that bounced messages that contained top-bit-set
characters. It was a real PITA for messages originating from countries
where ISO-8859-1 was the common character set, because accented
characters often appeared in messages. My take on this is that you are
more likely to achieve what the user wants by just transmitting such
characters.

Note that in some environments it is easy to get a top-bit-set character
into text by mistake - just brush against the wrong sequence of keys.

> I'm not going to argue whether the standard is stupid or not, but it *is* the
> standard, and if you start ignoring the parts you don't like, that's where
> the weird incompatibilities start popping up.

I'm afraid people have always ignored parts of RFCs they don't like.
This is the way the Internet works. In effect, the RFCs document
practice that should guarantee interworking. This doesn't always follow,
as I have learned in Exim. A number of restrictions have had to be
relaxed over the years because "other MTAs work that way". This means
there has to be a judgement call on each case.

I won't post anything more on this thread, but here is how I now stand:

ORIGINALLY

When I wrote Exim, I took the position that lines in files and pipes
inside Unix were LF-terminated, and lines on the wire in an SMTP
transaction were CRLF terminated. Translations were done for SMTP
input/output.

That didn't last long. :-(

COMPROMISES

  (i)  There were MTAs that sent bare LFs over the wire, "and they work
       with other MTAs".

  (ii) There were programs the injected local messages with CRLF, "and
       they work with other MTAs".

 (iii) There were cases where people wanted delivery using CRLF to a
       file or a pipe.

For (ii) I invented -dropcr and later drop_cr. At first, it dropped
*all* CRs, later just those that precede LF. For (iii) I invented
use_crlf.

RETHINK

It seems clear that my previous approach to treating bare CRs like any
other "illegal" character (such as top-bit-set characters) is probably
not the best approach. Therefore, I am going to change Exim so that:

(a) The character sequences LF, CRLF, and CR-without-LF are all treated
    as line endings when Exim reads a message.

(b) Each line ending will be converted to LF on input, so that
    internally, Exim continues to use LF terminators.

(c) However, a bare CR in a message's headers will be treated specially,
    because such a character probably does not indicate the end of a
    header line. It will be converted to LF followed by one space.
    (I might even try to be cleverer and do this only if the following
    text doesn't look like the start of the next header line.)

(d) In a body, a bare CR will be converted to LF.

One advantage of doing bare CR handling is that it will stop people
playing silly games such as trying to obscure text and prevent it being
displayed.

I will make a new snapshot when I have done this work.

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book

This message is part of the following thread:
	the complete thread tree sorted by date
	Barry Pederson at
	Sheldon Hearn at