Re: [Exim] [exim 3.31] strange retry behavior

Author: Philip Hazel
Date:
To: Marc Haber
CC: exim-users
Subject: Re: [Exim] [exim 3.31] strange retry behavior

On Tue, 30 Jul 2002, Marc Haber wrote:

> >No, as I said, it isn't quite like that. Let me quote the manual to you
> >again:
>
> Quoting what I have read, read and read doesn't help. There are
> non-native speakers who don't have your command of your native
> language and who sometimes happen to see a second way of interpreting
> what you wrote with only one interpretation in mind.

Sorry, I wasn't meaning to imply that you hadn't read it. This is such a
tricky point that I wanted to be sure that the relevant bit of the
manual was quoted in the message, for future reference.

I'm aware that I have a great advantage in being able to do my job
entirely in my native language. I'm also aware that, however hard I try,
I'm sure to write stuff that is ambiguous, or hard to understand, even
to native English speakers. When people point out examples, I try to
correct them.

You are right. That particular paragraph is terse in the extreme. I have
made a note to expand it and give more explanation in the next edition.
I don't think that this is an area of Exim that many people are
interested in, which is probably why nobody has questioned it before.
Mostly, people just have a default retry rule

> But that does not explain what my exim is doing. Let's go back to my
> original posting.

Good idea. I have forgotten the original question!

> example.com is listed in relay_domains; there is no entry for
> mx.otherprovider in any file.
>
> Hence, I'd expect the following to happen:
>
> - Mail for foo@??? comes in
> - exim tries to deliver to mx.otherprovider. Doesn't succeed (timeout)
> - exim queues the message

It you want to be strictly accurate, it isn't quite like that.

  - Mail for foo@??? comes in
  - the message is put on the queue
  - a first delivery process is started for the message
  - exim tries to delivery to mx.otherprovider - there's a timeout
  - the message is left on the queue, with a retry time computed from
    the retry rules.

> - on next delivery attempt, exim starts looking in the retry config

No, it isn't like that. The retry config is inspected at the end of the
failed delivery, not at the start of the next delivery. Exim computes a
retry time for the failing host.

> - look for mx.otherprovider in long_queue_domains. don't find it.
> - look for example.com in long_queue_Domains. don't find it either.
> - look for mx.otherprovider in relay_domains. don't find it.
> - look for example.com in relay_domains. Find it.

So those operations would have happened at the end of the previous
delivery, to compute a retry time for the host mx.otherprovider. Let me
remind myself of your retry rules:

partial-lsearch;CONFDIR/long_queue_domains      *       F,2h,15m; F,14d,2h
partial-lsearch;CONFDIR/relay_domains   *       F,2h,15m; F,5d,2h
*                                       *       F,2h,15m; G,16h,1h,1.5; F,2d,8h

OK, what you said above makes sense. (I assume that when you said
"example.com is listed in relay_domains; there is no entry for
mx.otherprovider in any file" you meant that the file
CONFDIR/relay_domains contains "example.com" as one of its lines.)

> - use queue settings F,2h,15m; F,5d,2h, specifying to keep the message
> for five days.

No. That rule specifies to keep the message until the host has been down
for five days. It doesn't matter how long the message has been on the
queue. If the host has been down for 5 days, new messages will be
bounced immediately.

But yes, I agree that it should be using that retry rule. For a
double-check, you could try running

exim -brt mx.otherprovider example.com

> That does not explain why exim started bouncing messages after 38
> hours.

It depends on when Exim first detected that the host was down. But read
on...

In a previous message you posted:

Deliver: mx.example.com [(ip address)] error 110: Connection timed out
first failed: 27-Jul-2002 01:15:16
last tried: 29-Jul-2002 17:56:10
next try at: 29-Jul-2002 19:56:10

There are two facts we can deduce from that: (1) As the interval between
"last tried" and "next try at" is 2 hours, this must have been computed
from your first or second retry rule, because they are the only ones
with 2h in them. So most likely it WAS the expected rule. (2) The output
does not say "past final cutoff time", so Exim doesn't believe that the
host has been down long enough to bounce messages.

> Sorry for being so stupid, but I am really confused.

You are not being stupid. This is a confusing area. I am also confused.

From that evidence, I cannot understand why it should bounce messages
after 38 hours. The retry information does not indicate that the host
retry time has expired, so it should not be bouncing.

The only thing I can now suggest for getting further information as to
what is going on is to send a test message, with debugging turned on so
we can see exactly what Exim is doing. Something like

exim -d9 xxx@???
.

I guess you could use an invalid xxx because we know it isn't actually
going to try a delivery, because it can't contact the host.

However, time has now passed, so maybe you can't run this kind of test
any more. One other thing could be done, and that is to grep out all
references to mx.example.com from your log files to see if anything can
be deduced from that information.

Philip

--
Philip Hazel            University of Cambridge Computing Service,
ph10@???      Cambridge, England. Phone: +44 1223 334714.

This message is part of the following thread:
	the complete thread tree sorted by date
	Marc Haber at
	Marc Haber at