[Exim] Retrying after a long period of failure: opinions sought

Author: Philip Hazel
Date:
To: exim-users
Subject: [Exim] Retrying after a long period of failure: opinions sought

When a remote host has been down for a long time - longer than the final
cutoff time in its retry rule - Exim's behaviour is quite complicated.
I'm wondering if it can be simplified and so would like to know what
people think.

BACKGROUND

When I wrote Exim, I was experimenting with "host-based" retrying - that
is, basing the retry behaviour on how long the host had been down, not
on how long a failing message had been on the queue. This seems to have
worked quite well in avoiding too many unnecessary connections when
there are a number of messages for the same host.

When a host has been down for a long time and there is one more failure,
the retry data is marked "expired" and the relevant address is bounced.
The questions then are:

  (1) What to do with other messages for the same host that are already
      on the queue? Bounce them with or without retrying the host?

(2) What about new messages that arrive later on?

Because I was worried about reducing the number of connections, I came
up with the idea of continuing to compute retry times after a host had
been down longer than its final retry time (typically 4 or 5 days).
Until the next retry time is reached, Exim bounces addresses for such
hosts without even trying to connect to them. This applies both to
existing messages and new ones. Assuming that messages for the host
continue to arrive, it tries every now and again, but only for the
occasional message.

Some people were not happy with this behaviour, and wanted more frequent
retries, so as to detect that the host had come back more quickly than
the retry time, which is often several hours at that stage. So I
invented the delay_after_cutoff option. In the default state, the
behaviour is as above, but if delay_after_cutoff is set false, Exim
always retries an expired host, as long as it has not been tried since
the message arrived.

This means that after one message is bounced following a long failure
period, others on the queue are bounced without trying (because they
probably arrived some time ago). New messages, however, are likely to
try a connection before bouncing, unless a whole lot arrive at once.

REASONS FOR THINKING OF CHANGING

(1) Most configurations end up with long retry periods at the end (the
default configuration has 6 hours). In the default case, this can mean
that messages may be bounced for a number of hours after a host returns
to life after being dead for a long time.

(2) Back in 1995 hosts were smaller and slower, and so were networks. In
practice, a host being down for 4/5 days is pretty rare, and a few extra
failed connection attempts are not a disaster. Do we really need the
complication of the two different behaviours?

(3) This is complicated to explain, and people don't always understand
what is going on. Not many people bother to unset delay_after_cutoff.

PROPOSAL

Abolish the delay_after_cutoff option, and always retry if the host has
not been tried since the message arrived, but otherwise bounce without
attempting a connection. If there are 100 messages already on the queue,
one would be tried, and the other 99 bounced without trying (as now).
However, unless a lot came in at once, there would be one try for
messages that arrive later.

[When I say "abolish the delay_after_cutoff option", what I would
actually do would be to de-document it, and make it do nothing, while
retaining it in the code for a few releases to allow for compatible
upgrading.]

COMPLICATION

The complication is that an address might be routed to more than one
host, and some may have expired while others have not. Example: two MX
hosts and the primary is down for a long time, so everything is going to
the secondary. We need to ensure that the primary is tried occasionally,
so it is still necessary to compute retry times for expired hosts. The
proposed rule above applies only when *all* hosts have passed their
expiry times.

VIEWS?

Sorry this has been so long. If you've read this far, you might have a
view. Please post it!

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.

This message is part of the following thread:
	the complete thread tree sorted by date

	Dean Brooks at

[Exim] Retrying after a long period of failure: opinions sou…