Re: [exim] multi-stage fallback

Top Page
Delete this message
Reply to this message
Author: W B Hacker
Date:  
To: exim users
Subject: Re: [exim] multi-stage fallback
Ian P. Christian wrote:
> 2009/2/3 Nigel Metheringham <nigel.metheringham@???>:
>> By definition this box is only getting the least deliverable messages.
>>
>> Which would make me wonder about the idea of very frequent queue
>> runners (but feel free to show me this feeling is wrong).
>

Hmmm.. well having actully come from that '4WPM' telegraph industry, I'd
submit that retry and timeouts that were appropriate in the fidonet and
BBS and UUCP era - a time where more networks weren't (networks) or at
least slower, less reliable, and not even always-up, we might be better
served to rethink whether we *should even attempt* delivery for anywhere
near as long as we once had to do.

- Folks nowadays have come to rely on smtp for faster and more 'certain'
delivery than was expected. And it delivers that - well beyond
expectations, and cheaply so.

- But that leads us to 'trust' it for more time-sensitive traffic than
traditional 3 or 4 day retry timeout actually serves.

- Most users in todays' environment would prefer to 'be aware' of a
problem sooner. Much sooner.

After all, phone and fax are also cheaper than they were in fidonet
days, so if a message is of suffienct importance or time-sensitivity, a
failure DSN 'soonest' allows that sort of fallback. Or a manual re-send.

> You are correct - this server is full of mail to domains that are
> currently not accepting mail, hosts that impose greylisting, or any
> other reason for the mail not being immediately deliverable.
>


IF your traffic is 'clean' - i.e. not relayed spam etc, there should be
very little of what cannot be delivered that will *ever* be deliverable
by a 'fallback' outbound critter.

Vanishingly small - unless of course your service is being subborned
into spewing spam, acting as an open relay, supporting a dictionery
attack from infected boxen on your inside net - or some such rudeness.

Quit early, let the primary send a DSN back to your authenticated
submission client (and no others), and it is off the queue while they
seek to correct spelling, or get an email address for their
correspondent that actually works.

Getting such a DSN back to them from a fallback box to which they do not
attach and authenticate is tedious at best, risks compounding the
problem at worst..

> This isn't a problem I can break down by domain, as we're talking
> about mail going from inside our network to outside.
>


ACK. I'd simply tune up the primary(ies) and shut the sucker down.

Where you want fallback/failover is on the inbound side so you don't
become one of those unreachable domains.

;-)

.. and/or a 'pool' of outbound servers, but peers - not a cascade.

> The idea of breaking down the problem by time was to allow for a
> fallback host to handle mail for the first 4 hours, where it might be
> being greylisted - allowing for the queue runners to quickly deal with
> such things, and not get bogged down with 10k's of older mail.
>


I've found greylisting (for all its negatives), to NOT be a significant
issue. It is *supposed to* only affect the first message, generally does
so, and thereafter goes essentially invisible to the sender.

I doubt it has any significant contribution to the balked deliveries on
your primary that now clog the fallback queue.

But 'undeliverable' is usually just that. It is not all that often it
improves day 'x' over first-few-minute (milliseconds, even...).

Not even with majority third-world destinations.

> I'm welcome to suggestions that I'm potentially dealing with the issue
> incorrectly, I'm certainly not set on the idea of multi-stage
> fallbacks. I do remember this being demonstrated by Phil at a
> conference I went to in Cambridge though....
>


Specialty case - Exim can handle all manner of those. But we should not
always ask it to do 'edge' cases.

The traffic figures you cite sound an awful lot like an abused box or
user pool with compromised machines.

Question: Your fallback server. Are you certain that no submission can
be made to it *except* by your own primary? EG - port 25 is not
listening , and/or it bound to only an internal NIC and IP.

>> You do want to ensure that messages have been routed, so that when a
>> delivery succeeds, another message can be attempted in the same session.
>
> Sorry, can you expand on what you mean here?


AFAIK, that could imply that if not taking place on the 'primary' box,
subsequent messages still on the primary are not in the same queue (yet)
so still sit. Further, any updated routability info the fallback box
gleans will not be shared. One could find a way to share the caches,
history / hints DB .. but that probably adds to the wrong side of the
complexity scorecard. 'KISS'

>
>> Tweaking of timeouts to avoid tarpits may be useful.
>
> Any suggestions here would be very welcome.
>
> Thanks for all those who have posted so far.
>


I suspect you'll do the most good by taking a fresh look at how your
primary is set up...

And analyzing the traffic sitting in the queue.

Hint:

SSH in, invoke a simple browser (lynx, links, or such). Point that
browser into the queue, wander about, and see what the headers and such
look like.

I'll bet a lot of it is garbage that shoudl never have made it there.

Bill