On Thu, 1 Aug 2002, Marc MERLIN wrote:
> So, I have mail for a user that was bounced by exim
>
> 2002-08-01 05:17:40 17aEkW-0000od-00 mx9.sun.com [192.18.98.34]: Connection timed out
> 2002-08-01 05:20:49 17aEkW-0000od-00 mx8.sun.com [192.18.98.36]: Connection timed out
> 2002-08-01 05:23:58 17aEkW-0000od-00 mx2.sun.com [192.18.98.43]: Connection timed out
Notice the IP addresses: 192.18.98.{34,36,43}.
> 2002-08-01 05:23:58 17aEkW-0000od-00 == marco.walther@??? R=lookuphost T=remote_smtp defer (110): Connection timed out
> 2002-08-01 05:23:58 17aEkW-0000od-00 ** marco.walther@???: retry timeout exceeded
>
> Ok, we've all seen this, but mail to him was working less than 24H previous
> to that:
> 2002-07-31 17:33:18 17a3tx-0006Vt-01 => marco.walther@???
F=<svlug-bounces+marco.walther=sun.com@???> R=lookuphost
T=remote_smtp S=5049 H=mx1.sun.com [192.18.98.31] C="250 SAA17437
Message accepted for delivery"
Notice that the IP address there is 192.18.98.31. That is different to
the three above.
> So, I'm trying to find out why exim gave up delivery to him before the 4
> days expired _and_ a delivery worked soon before that?
I think I now understand this. I also think it's an obscure bug in Exim,
so I have made a note to try to find a way of improving things.
The clue is the hosts_max_try option in the smtp transport, whose
default is 5. The sun.com domain currently resolves to 7 hosts:
mx6.sun.com. A 192.18.42.13
mx8.sun.com. A 192.18.98.36
mx9.sun.com. A 192.18.98.34
mx7.sun.com. A 192.18.100.1
mx1.sun.com. A 192.18.98.31
mx2.sun.com. A 192.18.98.43
mx5.sun.com. A 192.18.42.14
[This shows the usefulness of posting real log data. If you had obscured
the domain, I would never have thought of this.]
So Exim would have picked 5 to try. If all 5 failed, it would have
looked at their retry times, and if they were all expired, it would have
bounced the message, ignoring the other two hosts.
> You'll see that there are some connection timeouts by some MXes, but MX1
> also returned a lot more C="250 GAA24313 Message accepted for delivery"
>
> Could it be that the failure cache is by MX, and that when the delivery
> failed, an MX lookup didn't return MX1, and all the other MXes failed (they
> apparently always do)?
Effectively, yes! It returned MX1, but Exim discarded it by virtue of
the host_max_try setting.
<grumble>
What is the point of putting 7 MXs into the DNS if most of them always
reject connections?
</grumble>
Your workaround, of course, is to set hosts_max_try to some larger
number. My task is to make Exim look at the retry time for *all* the
hosts before bouncing an address.
Thanks for persisting on this one.
Philip
--
Philip Hazel University of Cambridge Computing Service,
ph10@??? Cambridge, England. Phone: +44 1223 334714.