Re: [exim] multi-stage fallback

Top Page
Delete this message
Reply to this message
Author: Richard Pitt
Date:  
To: exim users
Subject: Re: [exim] multi-stage fallback
I've set up a multi-stage fallback that also is multi-pool for one of my
customers. Note that the customer is a legitimate e-mail marketing
provider. They host thousands of "internal" lists for franchise
companies as well as a number of other double opt-in lists, so they're
not quite the same as an ISP.

Our queue runners are at 1 minute intervals. I have a daemon that
monitors the queue length and starts multiples if it gets beyond 1000 or
so.

I have a daemon that watches RAM use and does not allow the system to
swap no matter what - but rather than letting the kernel kill processes
I have the daemon do it only to exim processes.

The machines have no interactive component - they are 100% exim - so
loads can and do grow beyond 100. I keep a "nice --20" shell open on
them to do maintenance.

Failover follows a first set of routers that pushes yahoo and other
domains that use DKIM to a machine that is set up for it and another for
hotmail's various aliases (msn, etc.) to another that handles "problem"
domains that have been identified. We'll eventually make all the
machines do DKIM but probably keep the separate one for Yahoo as they
sometimes go into "slow" mode and only accept a few per minute.

The rest are delivered as normal with a failover to a machine that
handles only failovers. Over 85% of the mail is handled by the first
machine, the yahoo machine and the hotmail machine. The rest ends up on
the failover machine where we have a 1 day timeout. I expect an ISP
would extend this but on my own E-mail service I have this set to a max
of 4 days, not the traditional 7.

richard

On Wed, 2009-02-04 at 01:35 +0800, W B Hacker wrote:
> Ian P. Christian wrote:
> > 2009/2/3 Nigel Metheringham <nigel.metheringham@???>:
> >> By definition this box is only getting the least deliverable messages.
> >>
> >> Which would make me wonder about the idea of very frequent queue
> >> runners (but feel free to show me this feeling is wrong).
> >
> Hmmm.. well having actully come from that '4WPM' telegraph industry, I'd
> submit that retry and timeouts that were appropriate in the fidonet and
> BBS and UUCP era - a time where more networks weren't (networks) or at
> least slower, less reliable, and not even always-up, we might be better
> served to rethink whether we *should even attempt* delivery for anywhere
> near as long as we once had to do.
>
> - Folks nowadays have come to rely on smtp for faster and more 'certain'
> delivery than was expected. And it delivers that - well beyond
> expectations, and cheaply so.
>
> - But that leads us to 'trust' it for more time-sensitive traffic than
> traditional 3 or 4 day retry timeout actually serves.
>
> - Most users in todays' environment would prefer to 'be aware' of a
> problem sooner. Much sooner.
>
> After all, phone and fax are also cheaper than they were in fidonet
> days, so if a message is of suffienct importance or time-sensitivity, a
> failure DSN 'soonest' allows that sort of fallback. Or a manual re-send.
>
> > You are correct - this server is full of mail to domains that are
> > currently not accepting mail, hosts that impose greylisting, or any
> > other reason for the mail not being immediately deliverable.
> >
>
> IF your traffic is 'clean' - i.e. not relayed spam etc, there should be
> very little of what cannot be delivered that will *ever* be deliverable
> by a 'fallback' outbound critter.
>
> Vanishingly small - unless of course your service is being subborned
> into spewing spam, acting as an open relay, supporting a dictionery
> attack from infected boxen on your inside net - or some such rudeness.
>
> Quit early, let the primary send a DSN back to your authenticated
> submission client (and no others), and it is off the queue while they
> seek to correct spelling, or get an email address for their
> correspondent that actually works.
>
> Getting such a DSN back to them from a fallback box to which they do not
> attach and authenticate is tedious at best, risks compounding the
> problem at worst..
>
> > This isn't a problem I can break down by domain, as we're talking
> > about mail going from inside our network to outside.
> >
>
> ACK. I'd simply tune up the primary(ies) and shut the sucker down.
>
> Where you want fallback/failover is on the inbound side so you don't
> become one of those unreachable domains.
>
> ;-)
>
> .. and/or a 'pool' of outbound servers, but peers - not a cascade.
>
> > The idea of breaking down the problem by time was to allow for a
> > fallback host to handle mail for the first 4 hours, where it might be
> > being greylisted - allowing for the queue runners to quickly deal with
> > such things, and not get bogged down with 10k's of older mail.
> >
>
> I've found greylisting (for all its negatives), to NOT be a significant
> issue. It is *supposed to* only affect the first message, generally does
> so, and thereafter goes essentially invisible to the sender.
>
> I doubt it has any significant contribution to the balked deliveries on
> your primary that now clog the fallback queue.
>
> But 'undeliverable' is usually just that. It is not all that often it
> improves day 'x' over first-few-minute (milliseconds, even...).
>
> Not even with majority third-world destinations.
>
> > I'm welcome to suggestions that I'm potentially dealing with the issue
> > incorrectly, I'm certainly not set on the idea of multi-stage
> > fallbacks. I do remember this being demonstrated by Phil at a
> > conference I went to in Cambridge though....
> >
>
> Specialty case - Exim can handle all manner of those. But we should not
> always ask it to do 'edge' cases.
>
> The traffic figures you cite sound an awful lot like an abused box or
> user pool with compromised machines.
>
> Question: Your fallback server. Are you certain that no submission can
> be made to it *except* by your own primary? EG - port 25 is not
> listening , and/or it bound to only an internal NIC and IP.
>
> >> You do want to ensure that messages have been routed, so that when a
> >> delivery succeeds, another message can be attempted in the same session.
> >
> > Sorry, can you expand on what you mean here?
>
> AFAIK, that could imply that if not taking place on the 'primary' box,
> subsequent messages still on the primary are not in the same queue (yet)
> so still sit. Further, any updated routability info the fallback box
> gleans will not be shared. One could find a way to share the caches,
> history / hints DB .. but that probably adds to the wrong side of the
> complexity scorecard. 'KISS'
>
> >
> >> Tweaking of timeouts to avoid tarpits may be useful.
> >
> > Any suggestions here would be very welcome.
> >
> > Thanks for all those who have posted so far.
> >
>
> I suspect you'll do the most good by taking a fresh look at how your
> primary is set up...
>
> And analyzing the traffic sitting in the queue.
>
> Hint:
>
> SSH in, invoke a simple browser (lynx, links, or such). Point that
> browser into the queue, wander about, and see what the headers and such
> look like.
>
> I'll bet a lot of it is garbage that shoudl never have made it there.
>
> Bill
>
>

-- 
Richard C. Pitt                 Pacific Data Capture
rcpitt@???               604-644-9265
http://blog.pacdat.net       www.pacdat.net
PGP Fingerprint: FCEF 167D 151B 64C4 3333  57F0 4F18 AF98 9F59 DD73