Re: [exim] Weird retry behaviour

Author: Todd Lyons
Date:
To: Russell King
CC: Exim Users
Subject: Re: [exim] Weird retry behaviour

On Sun, Jul 27, 2014 at 8:02 AM, Russell King <rmk+exim@???> wrote:
> I know that 4.69 is an old version of exim, but... I'm seeing some
> weird behaviour with it.

TL;DR, I agree with your hunch that it's an old-exim-version problem.
A fair amount of things have been touched WRT retry logic since 4.69.

> The machine in question acts as a backup machine for another computer.
> It's setup such that each night, it powers itself on, transfers the
> data, archives it, sends a mail and powers off. Once a week, it
> remains on for a 24 hour period.

Can you verify the clock accuracy during the duration of each of the
periods? Running ntp on it? Setting the time before exim starts?

> The problem is this - exim behaves itself just fine when it can send
> the message immediately. If it can't (because of the DSL at the site
> being down) then exim gives me a hard failure and bounces the message.
>
> This goes totally against what is in the config file for the retry
> rules:
>
> * * F,2h,15m; G,16h,1h,1.5; F,4d,6h

Totally standard line, nothing complicated.

> The config file is pretty much standard Fedora 14, but with these as
> the routers (as is the above line being the F14 default):
>
> remote_smtp:
> driver = smtp
> headers_rewrite = *@* hidden@??? fs
> return_path = hidden@???
>
>
> So it should take many days before bouncing. However:
>
> 2014-07-26 07:01:42 1XAv38-0000iU-Ln <= backup@??? U=backup P=local S=65027 id=20140726060142.GA2756@shgc-backup
> 2014-07-26 07:01:48 1XAv38-0000iU-Ln => rmk@??? R=dnslookup T=remote_smtp H=mx0.arm.linux.org.uk [78.32.30.218] X=TLSv1:AES256-SHA:256
>
> that one was fine. Then this morning:
>
> 2014-07-27 04:19:35 1XBEzn-0000XA-FM <= root@??? U=root P=local S=3340
> 2014-07-27 04:20:17 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
> 2014-07-27 04:21:22 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
> 2014-07-27 04:26:21 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
> 2014-07-27 04:31:23 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
> 2014-07-27 04:36:42 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete

I want to make a casual observation here that it doesn't seem to be
complaining that it couldn't connect to the host to transfer mail.
It's saying it couldn't connect to the DNS server to do the actual DNS
lookup. I think that distinction means that it is not yet using the
retry logic.

Ok, here's a new email getting sent, 2.5 hours later.

> 2014-07-27 06:44:43 1XBHGF-0000iX-Rm <= backup@??? U=backup P=local S=63350 id=20140727054423.GA2759@shgc-backup
> 2014-07-27 06:45:24 1XBHGF-0000iX-Rm == rmk@??? R=dnslookup defer (-1): host lookup did not complete
> 2014-07-27 06:46:19 1XBEzn-0000XA-FM == rmk@??? <root@???> routing defer (-51): retry time not reached
> 2014-07-27 06:46:19 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached

Both of them are being skipped for routing because there is an entry
in the routing tables that gives a status. My question is whether
that status actually is that it couldn't complete the DNS lookup, and
whether that status should be preventing subsequent route processing.
If I'm interpreting the message incorrectly, then what I've just
written above is not valid. (just a little disclaimer :-)

> 2014-07-27 09:21:40 1XBEzn-0000XA-FM == rmk@??? <root@???> R=dnslookup defer (-1): host lookup did not complete
> 2014-07-27 09:21:40 1XBHGF-0000iX-Rm == rmk@??? routing defer (-51): retry time not reached

That is probably still the cached result "host lookup did not
complete". Here is where it gets interesting:

> 2014-07-27 11:40:59 1XBHGF-0000iX-Rm mx0.arm.linux.org.uk [2002:4e20:1eda:1:214:fdff:fe10:1be6] Network is unreachable
> 2014-07-27 11:40:59 1XBHGF-0000iX-Rm mx0.arm.linux.org.uk [2001:4d48:ad52:3201:214:fdff:fe10:1be6] Network is unreachable
> 2014-07-27 11:41:04 1XBHGF-0000iX-Rm => rmk@??? R=dnslookup T=remote_smtp H=mx0.arm.linux.org.uk [78.32.30.218] X=TLSv1:AES256-SHA:256
> 2014-07-27 11:41:04 1XBHGF-0000iX-Rm Completed

Obviously the DNS lookup completed because it figured out where to
send the email. It correctly determined that the ipv6 transport
couldn't be used and did a fallback to ipv4. Note for the 4.83
release there was some issue with the fact that ipv6 is always
preferred internally, even if the outgoing IP is specified to be an
ipv4 address. It's not a problem for you though since it correctly
falls back to ipv4.

Note the 15 second delay. I find that interesting. Moving on:

> 2014-07-27 11:41:19 1XBEzn-0000XA-FM ** rmk@??? <root@???> R=dnslookup T=remote_smtp: retry time not reached for any host after a long failure period

It seems to me that at this point, it should have known that it was ok too.
> So, at 11:41:04, exim found that the destination was now able to be
> delivered to. However, it decided to time out the 1XBEzn-0000XA-FM
> message _before_ the retry rules stated that it should time out, and
> sent a non-delivery report... which it also successfully delivered to
> the same destination!

I looked through the ChangeLog to see if there was something that
sounded like it addressed this. I found the following in the git
history that seemed slightly relevant:

4.82 release:     Fix ultimate retry timeouts for intermittently
deliverable recipients.

...which sounds very strongly like it was a bug that was fixed in 4.82
last year.

> The DNS for the machine is configured to use google's DNS servers
> (iow, 8.8.8.8 and 8.8.4.4) as I've had problems with the ISPs DNS
> servers - so DNS would have been unavailable during the loss of
> connectivity too.

It took me a few reads to grok this. The intermittent connectivity is
the DSL line that the host running Exim 4.69 uses for internet
connectivity. So that's the reason for the DNS failures mentioned
above.

> So, the question is whether there's something screwed with the config
> file, or whether it's just this old exim version misbehaving (which I
> suspect is the real problem here.) What I don't understand is why the

I concur.

> successful delivery of 1XBHGF-0000iX-Rm seemed to cause 1XBEzn-0000XA-FM
> to be immediately bounced.

...which I suspect is fixed by Tony's patch above that went into 4.82.

> This probably isn't an issue that I can reproduce at will; I've seen it
> a number of times, and it's always triggered by the loss of connectivity
> at the site.

Is there any way to test a newer Exim version on that host?

...Todd
--
The total budget at all receivers for solving senders' problems is $0.
If you want them to accept your mail and manage it the way you want,
send it the way the spec says to. --John Levine

This message is part of the following thread:
	the complete thread tree sorted by date
	Russell King at
	Dan Carroll at