[exim] Weird retry behaviour

Top Page

Reply to this message
Author: Russell King
Date:  
To: Exim Users
Subject: [exim] Weird retry behaviour
I know that 4.69 is an old version of exim, but... I'm seeing some
weird behaviour with it.

The machine in question acts as a backup machine for another computer.
It's setup such that each night, it powers itself on, transfers the
data, archives it, sends a mail and powers off. Once a week, it
remains on for a 24 hour period.

The problem is this - exim behaves itself just fine when it can send
the message immediately. If it can't (because of the DSL at the site
being down) then exim gives me a hard failure and bounces the message.

This goes totally against what is in the config file for the retry
rules:

*                      *           F,2h,15m; G,16h,1h,1.5; F,4d,6h


The config file is pretty much standard Fedora 14, but with these as
the routers (as is the above line being the F14 default):

remote_smtp:
driver = smtp
headers_rewrite = *@* hidden@??? fs
return_path = hidden@???


So it should take many days before bouncing. However:

2014-07-26 07:01:42 1XAv38-0000iU-Ln <= backup@??? U=backup P=local S=65027 id=20140726060142.GA2756@shgc-backup
2014-07-26 07:01:48 1XAv38-0000iU-Ln => rmk@??? R=dnslookup T=remote_smtp H=mx0.arm.linux.org.uk [78.32.30.218] X=TLSv1:AES256-SHA:256

that one was fine. Then this morning:

2014-07-27 04:19:35 1XBEzn-0000XA-FM <= root@??? U=root P=local S=3340
2014-07-27 04:20:17 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
2014-07-27 04:21:22 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
2014-07-27 04:26:21 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
2014-07-27 04:31:23 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
2014-07-27 04:36:42 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
...
2014-07-27 05:16:42 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
...
2014-07-27 05:36:42 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
...
2014-07-27 05:56:41 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
...
2014-07-27 06:16:41 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
2014-07-27 06:36:41 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete

2014-07-27 06:44:43 1XBHGF-0000iX-Rm <= backup@??? U=backup P=local S=63350 id=20140727054423.GA2759@shgc-backup
2014-07-27 06:45:24 1XBHGF-0000iX-Rm == rmk@??? R=dnslookup defer (-1): host lookup did not complete
2014-07-27 06:46:19 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
2014-07-27 06:46:19 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached
...
2014-07-27 07:46:39 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
2014-07-27 07:46:39 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached
...
2014-07-27 08:46:19 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
2014-07-27 08:46:19 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached
...
2014-07-27 09:21:40 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
2014-07-27 09:21:40 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached
...
2014-07-27 11:40:59 1XBHGF-0000iX-Rm mx0.arm.linux.org.uk [2002:4e20:1eda:1:214:fdff:fe10:1be6] Network is unreachable
2014-07-27 11:40:59 1XBHGF-0000iX-Rm mx0.arm.linux.org.uk [2001:4d48:ad52:3201:214:fdff:fe10:1be6] Network is unreachable
2014-07-27 11:41:04 1XBHGF-0000iX-Rm => rmk@??? R=dnslookup T=remote_smtp H=mx0.arm.linux.org.uk [78.32.30.218] X=TLSv1:AES256-SHA:256
2014-07-27 11:41:04 1XBHGF-0000iX-Rm Completed
2014-07-27 11:41:19 1XBEzn-0000XA-FM ** rmk@??? <root@???> R=dnslookup T=remote_smtp: retry time not reached for any host after a long failure period
2014-07-27 11:41:19 1XBLtH-0000qh-O9 <= <> R=1XBEzn-0000XA-FM U=exim P=local S=4383
2014-07-27 11:41:19 1XBEzn-0000XA-FM Completed
2014-07-27 11:41:21 1XBLtH-0000qh-O9 => rmk@??? <root@???> R=dnslookup T=remote_smtp H=mx0.arm.linux.org.uk [78.32.30.218] X=TLSv1:AES256-SHA:256

So, at 11:41:04, exim found that the destination was now able to be
delivered to. However, it decided to time out the 1XBEzn-0000XA-FM
message _before_ the retry rules stated that it should time out, and
sent a non-delivery report... which it also successfully delivered to
the same destination!

The wait-remote_smtp database is empty.

The two most recent retry database entries are:

26-Dec-2013 03:06:32 27-Jul-2014 11:41:04 27-Jul-2014 17:41:04 *
T:mx0.arm.linux.org.uk:2002:4e20:1eda:1:214:fdff:fe10:1be6 101 77 Network is unreachable
24-Dec-2013 03:06:23 27-Jul-2014 11:41:04 27-Jul-2014 17:41:04 *
T:pandora.arm.linux.org.uk:2002:4e20:1eda:1:214:fdff:fe10:1be6 101 77 Network is unreachable

which are expected as the site running this exim has no IPv6 connectivity
to be able to use the IPv6 addresses I have here. The only entry for the
IPv4 address is an old one which should have expired long ago (and the
DNS changed since then):

13-Feb-2014 05:26:39 13-Feb-2014 05:26:39 13-Feb-2014 05:41:39
T:caramon.arm.linux.org.uk:78.32.30.218 110 333 Connection timed out

Indeed, having tidied the retry database, the only two entries which
remain are the two above.

The DNS for the machine is configured to use google's DNS servers
(iow, 8.8.8.8 and 8.8.4.4) as I've had problems with the ISPs DNS
servers - so DNS would have been unavailable during the loss of
connectivity too.

So, the question is whether there's something screwed with the config
file, or whether it's just this old exim version misbehaving (which I
suspect is the real problem here.) What I don't understand is why the
successful delivery of 1XBHGF-0000iX-Rm seemed to cause 1XBEzn-0000XA-FM
to be immediately bounced.

This probably isn't an issue that I can reproduce at will; I've seen it
a number of times, and it's always triggered by the loss of connectivity
at the site.

--
Russell King