Re: [exim] Weird retry behaviour

Top Page
Delete this message
Reply to this message
Author: Dan Carroll
Date:  
To: Russell King
CC: Exim Users
Subject: Re: [exim] Weird retry behaviour

On 31 Jul 2014, at 8:31 am, Russell King <rmk+exim@???> wrote:

> On Thu, Jul 31, 2014 at 07:27:14AM +1000, Dan Carroll wrote:
>> Have a look at the retry and wait spool hint database entries.
>
> I stated the contents of the retry and wait databases. That's very simple
> because the machine /only/ talks to my MTA here (to notify me of its
> operation each and every night.)
>
> The previous nights messages were delivered fine and in a timely manner.
>


Sorry, I missed that.

> It's not that /my/ host is down (did you read my message?) As I explained
> in my message, it's that the link to the site running this exim was down
> for just four hours, which was enough to trigger exim into bouncing the
> message *inspite* of the retry rules specifying days of retries, for a
> host which had been *known* to exim to be available less than 24 hours
> previously.


From exim’s point of view, the remote host was down. That’s what I meant (and yes, despite missing the db info you posted, I did read your message).
Given that info, I agree, it is weird retry behaviour.

>
>
> I'm well aware (being a 10+ year user) of exim's retry behaviour in the
> presence of multiple messages for a target host (that the timeout runs
> per target host, not per message), and the tools to manage exim.
>


Then I am not sure why you are surprised to see that it would not attempt delivery of a message to a MX who’s IP matches a retry entry that states no mail has been delivered to this host since February...

>> You can also delete the retry and wait* databases to reset everything….
>> (I’d probably offline exim while I did that but it’s likely not necessary).
>
> And using exim_tidydb (as I said I have done) and then I quoted the only
> two remaining entries in the retry database…
>


The fact that you see this in your logs:
"retry time not reached for any host after a long failure period”

Means that exim considered that the host was down for a long time. That information as I am sure you realise, comes from the retry/wait dbs.

You tidied up an old record from the MX of the domain you are trying to send to, and then the retry succeeded.
My guess is that the retry code matched the old DB entry (makes sense, the IP address is the same even if the hostname is different),which for some reason was not removed.

Another guess, perhaps it went like this:
entry is added in February when the host is offline.
host DNS changes (as you have stated it did)
now caramon.* does not exist, but the mx for the domain is mx0.*
mx0.* has the same IP address as caramon.

For some reason, caramon never gets cleaned (maybe exim does not clean up hosts when the hostname does not match or won’t resolve?)
I’m not sure how db entries are cleaned. Perhaps after a successful connection, exim removes the matching entry (host+ip), which in your case would mean that it would never remove caramon from the list.
You could test this theory by adding the entry back into the DB, creating a host entry to caramon.* and then forcing delivery to caramon.* via a customised router. If I’m right, then the old entry would disappear. Seems like a lot of work to test a theory however.

Then we try and deliver some mail in July.
retry matches the host caramon in the retry DB because the retry code only looks at the IP address.
So it does not even bother to retry.    It seems all retries have been exhausted.


I suspect if you left it alone, it would never have retried.
But you cleaned it and now the DBs working again.

To stop the problem from happening again:….

Read items 11 and 13 of this: http://www.exim.org/exim-html-current/doc/html/spec_html/ch-some_common_configuration_settings.html
Nothing there makes it work *much* better unfortunately. Exim was designed to be used on permanently connected hosts.

delay_after_cutoff might be interesting for you, but if the message arrived when the internet connection was down, then the bounce would have come anyway.

About the only other thing I could suggest is to monitor your DBs for more issues.

Good luck…
-D