Autor: Dean Brooks Data: A: Philip Hazel CC: exim-users Assumpte: Re: [Exim] Retrying after a long period of failure: opinions sought
On Thu, Dec 19, 2002 at 11:12:47AM +0000, Philip Hazel wrote:
> Because I was worried about reducing the number of connections, I came
> up with the idea of continuing to compute retry times after a host had
> been down longer than its final retry time (typically 4 or 5 days).
>
> Abolish the delay_after_cutoff option, and always retry if the host has
> not been tried since the message arrived, but otherwise bounce without
> attempting a connection. If there are 100 messages already on the queue,
> one would be tried, and the other 99 bounced without trying (as now).
> However, unless a lot came in at once, there would be one try for
> messages that arrive later.
Indeed, this was a confusing area of Exim for me, so my view of the
option may not be directly on target. However, some feedback with
regards to number of connections that may be relevant.
We run a fairly busy ISP Exim server (about 800,000 messages per day)
and consistently our top priority is reducing simultaneous TCP/IP
connections. Having too many connections results in too many
processes (in Exim's forking model), and so processes are considered a
scarce resource, even if the upper limit is 200 or so per machine.
Clamping down on maximum deliveries works, but at the expense of
growing the queue and slowing down delivery of mail. This isn't much
of a problem for us now, though, since Exim holds down retries very
nicely.
However, what if a stream of 10,000 messages come in for a long-term
failed host? Under the new proposal, will each message be retried
immediately upon receipt even if the IP was confirmed to be down 2
minutes ago? If so, that would seem extremely wasteful of our
precious resource, especially given that connect timeouts may be on
the order of minutes per message.
If a message is past it's long retry time, could the final retry
attempt be forced to occur by a queue-runner (i.e. put on queue
without immediate retry)? At least then the queue runner could
attempt to route a single message and then batch-fail the rest
immediately, rather than immediately trying to connect on every single
new incoming message?