[exim] Diagnosing delay in retrying

Top Page
Delete this message
Reply to this message
Author: Paul Warren
Date:  
To: Exim Mailing List
Subject: [exim] Diagnosing delay in retrying
We have three servers. One server generates a lot of mail and uses a
pair of servers as a smart host. The two servers are addressed under
the same name (mx.mythic-beasts.com), so the config on the sending
server looks like this:

smarthost:
driver = manualroute
route_data = mx.mythic-beasts.com
transport = remote_smtp

where:

$ host mx.mythic-beasts.com
mx.mythic-beasts.com has address 93.93.131.52
mx.mythic-beasts.com has address 93.93.130.6

Every now and again, exim on the sending server decides that it can't
send mail, and starts queuing mail. Looking at the logs, it appears
to be triggered by a connection time out:

2009-09-29 20:26:59 1MsiIf-0002cW-Ps == a@??? R=smarthost
T=remote_smtp defer (110): Connection timed out

and that will then be followed by lots of non-retries:

2009-09-29 20:26:59 1MsiLb-0003f7-DW == b@??? R=smarthost
T=remote_smtp defer (-53): retry time not reached for any host

Exim then appears to refuse to retry for an unreasonably long period
of time. For example, exim successfully sends a mail at 20:54. It
then receives a number of time outs up to 20:58. Then, it does not
appear to retry until 04:57 the following morning, despite logging a
"defer (-53): retry time not reached for any host" many times every
minute for the whole of that period.

Our retry configuration says:

begin retry

# Only retry bounce delivery once every 12 hours, for 4 days.
*                      *                senders=:           F,4d,12h


# Everything else, try once every 15 minutes for 12 hours, then once  
an hour,
# increasing by 150% each time, for 16 hours; then once every 8 hours  
for 4
# days.
*                      *                                    F,12h,15m;  
G,16h,1h,1.5; F,4d,8h


A couple of questions:

1. Why doesn't it retry during that 8 hour period? Surely the
successful send at 20:54 should reset the retry rules?

2. Does setting route_data to an A record with multiple IPs achieve
the redundancy I'm looking for? As far as I can tell, exim makes no
attempt to fall back on the second IP after the connection failure: it
hadn't seen a connection failure on the other IP for around 3 hours
prior to going into "won't send any mail" mode.

I'm separately trying to get to the bottom of why we're seeing the
connection refusal in the first place, but I'd like to understand why
our setup isn't as robust as I think it should be.

many thanks,

Paul