Re: [Exim] Retry rules/failing host

Author: Tim Jackson
Date:
To: exim-users
Subject: Re: [Exim] Retry rules/failing host

On Tue, 17 Dec 2002 11:29:03 +0000 (GMT) Philip wrote:

> > OK, understood. But messages (this isn't by any means a one-off) seem
> > to be bouncing consistently *exactly* (to the minute) two days after
> > they are received, which doesn't fit with this?
> Definitely not, especially the bit about "exactly to the minute".

Hmm, I've looked deeper and I think it was only the first message (i.e.
when the host had been failing for 4 days) which was 2 days exactly. And I
think the time of receipt in that case coincided roughly with the hourly
queue run time. Sorry about that. I can't find any other messages left in
my inbox that have done this, though I could have sworn it (and a
colleague noticed the same thing!). My eyes are going funny from looking
at too many very similar bounce messages though, or maybe the
pre-Christmas rush is taking it's toll.

> > Since I'm using a close-to-default config, delay_after_cutoff is
> > presumably true (I haven't set it false), and therefore they ought to
> > be bounced immediately, which isn't happening?
> If the host has been dead for 4 days, then yes, they ought to be bounced
> immediately.

And indeed, on closer investigation, some are - but some aren't.

I've looked into the logs/db, and AFAICT, the host in question has been
down since 1 December. Therefore, if I understand it correctly, all
messages received after ~6 December should have been bounced immediately
or fairly soon (next queue run) after, assuming the host didn't come back
in the meantime? This has happened for some messages which have been
bounced instantly on receipt, but not for others - for example, the one I
sent last night is still sitting on the queue.

> Try a delivery with debugging, so see what's going on:
> exim -d -M 18O3uf-0003Dv-00

I did this (very handy debugging, BTW!), a

[routing etc.]
...
checking status of failing.host
locking /var/spool/exim/db/retry.lockfile
locked /var/spool/exim/db/retry.lockfile
opened hints database /var/spool/exim/db/retry: flags=0
dbfn_read: key=T:failing.host:aaa.bbb.xxx.yyy
dbfn_read: key=T:failing.host:aaa.bbb.xxx.yyy:18O3uf-0003Dv-00
no message retry record
failing.host [aaa.bbb.xxx.yyy] status = usable
aaa.bbb.xxx.yyy in serialize_hosts? no (option unset)
delivering 18O3uf-0003Dv-00 to failing.host [aaa.bbb.xxx.yyy]
(user@???)
set_process_info: 16905 delivering 18O3uf-0003Dv-00 to
failing.host [aaa.bbb.xxx.yyy] (user@???)
Connecting to failing.host [aaa.bbb.xxx.yyy]:25 ... failed
LOG: MAIN
failing.host [aaa.bbb.xxx.yyy]: Connection refused
set_process_info: 16905 delivering 18O3uf-0003Dv-00: just tried
failing.host [aaa.bbb.xxx.yyy] for user@???: result DEFER
added retry item for T:failing.host:aaa.bbb.xxx.yyy: errno=111 77
flags=2
all IP addresses skipped or deferred at least one address
locking /var/spool/exim/db/wait-remote_smtp.lockfile
locked /var/spool/exim/db/wait-remote_smtp.lockfile
opened hints database /var/spool/exim/db/wait-remote_smtp: flags=42
dbfn_read: key=failing.host
Leaving remote_smtp transport
...
locking /var/spool/exim/db/retry.lockfile
locked /var/spool/exim/db/retry.lockfile
opened hints database /var/spool/exim/db/retry: flags=42
address match: subject=*@failing.host:aaa.bbb.xxx.yyy pattern=*
failing.host:aaa.bbb.xxx.yyy in "*"? yes (matched "*")
*@failing.host:aaa.bbb.xxx.yyy in "*"? yes (matched "*")
retry for T:failing.host:aaa.bbb.xxx.yyy (failing.domain) = *
dbfn_read: key=T:failing.host:aaa.bbb.xxx.yyy
Writing retry data for T:failing.host:aaa.bbb.xxx.yyy
first failed=1040048412 last try=1040163059 next try=1040184659
expired=0
error 111 77: Connection refused
dbfn_write: key=T:failing.host:aaa.bbb.xxx.yyy
end of retry processing
time on queue = 23h30m22s
warning counts: required 0 done 0
delivery deferred: update_spool=0 header_rewritten=0

Interestingly, the first failed timestamp is 14:20 on 16 December, so does
that suggest that the host has in fact been up recently? Or just that all
the retry data was "expunged"/ignored before then, and that was the first
retry after that?

> > Also, in section 31.6 you mention calculation of post-cutoff retry
> > times, but the algorithm isn't mentioned. Is it rather complicated?
> No - it just continues to use the last retry algorithm.

"Last" as in the righthand-most part of the retry rule? (in this case
F,4d,6h)? So, if I get this right, if Exim decides a host is "dead" (the
last cutoff is exceeded), and there are messages in the queue for that
host at a later time, Exim will retry at 4 hour intervals (irrelevant of
how many messages have arrived in the intervening period, assuming
delay_after_cutoff is true)?

Anyway, this is making my brain hurt now. I'm still not sure I understand
100% what's happening, but by and large things do seem to be moving in and
out of the queue, getting bounced etc. *broadly* as I would expect. I
think there may be some complicating factors I don't know about here, like
the host being intermittently accessible (as opposed to totally down,
which was my understanding), and the large number of bounce messages is
making me go cross-eyed, so I'm going to keep an eye on things and see
what happens. I have a feeling that (as I would expect) Exim is probably
doing things fundamentally right. This retry stuff is rather complicated
to track/debug what with the considerable time spans, cross-references
with the retry db, logs, bounced messages etc. Add to that the fact that I
need to fiddle with this configuration somewhat (it's the first time I've
moved to Exim for a [backup MX] host which is getting significant numbers
of deferred deliveries) and I think I'll see how things pan out.

Thanks for the help (and for the truly awesome software, which does pretty
much everything I want, although a 'MAIL' ACL would be nice so my vote's
on that wishlist item), and I'll let you know if I have any more insights!

Tim

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Philip Hazel at