Re: [Exim] error response rate limiting vs. overloaded syste…

Top Page
Delete this message
Reply to this message
Author: Exim Users Mailing List
Date:  
To: Avleen Vig
CC: Exim Users Mailing List
Subject: Re: [Exim] error response rate limiting vs. overloaded systems
[ On Saturday, July 5, 2003 at 22:03:21 (-0700), Avleen Vig wrote: ]
> Subject: Re: [Exim] error response rate limiting vs. overloaded systems
>
> Let me at least make sure we're on the same page:
> We're talking about rate limiting responses to clients who would receive
> an error. This conversation started talking about "bad" clients, but
> seems to have gone on to talk about all errors.


Yes, that's it.

> The chief culprits by far, of these errors, are spammers. Spammers cause
> more of these errors through "user unknown", "relay denied", "block
> because of XYZ BL", etc.


Yes, but that they are spammers is irrelevant....

> I refer you now, to RFC1925. Humorous as it's release date is, point 9
> is very relevant here:
> (9) For all resources, whatever it is, you need more.


Also essentially irrelevant (except for its humour content :-)

> I speak from the viewpoint of ISP's, because this is where most of my
> experience is based.
> When a spammer connects to your MX, he does not open one connections,
> pipeline as many messages as he can, close to, and then open another
> connection.


Perhaps, but also irrelevant.

The point here is to make sure that any client you've sent an error
response to can't immediately re-connect and do the same thing again.
It really doesn't matter why you've rejected the transaction -- it could
be the recipient is unknown, or it could be because you've used some
blacklist to reject UCE or whatever. You've already accepted the
connection. You've already got a process running to process it. Now
all you do is make that process go idle for what in system terms is a
very long extended period of time and during that time you let your
system re-use whatever of that process' resources it knows how to reuse
until it's time to finally send the last line and wait for the client to
disconnect. The added cost of doing this is negligible.

> Spammers appreciate that most ISP's, and even a number of businesses,
> now employ the use of connection rate limiting (in one form or another).
> He finds limits at which he can slam your servers from one of many
> drones around the world until either his list of recipients is
> exhausted, or your servers block him. He will open up as many
> connections as possible.


While this may be true of some spammers it is not generally true and it
is especially not true of broken client software which has been the most
noticable cause of problems in my experience with both ISPs and
corporate networks.

Most spamware (i.e. the stuff that exploits open relays and proxies, or
which routes to backup MXs, etc.) isn't really that smart -- and most
direct spammers are just using qmail, postfix, exim, or sendmail. Some
of the best spamware might know to back off when it gets 421's so that
it doesn't get tied up with a given server, but that's probably the best
of it.

> So, what happens when you add 25% more capacity? Yup, you're right back
> where you started.


Nope, you've failed to note that adding capacity doesn't mean you also
give it away under the control of third parties. The goal of DoS
protection mechanisms is to have control over your resources and to not
give that control to third parties. Like I said: Pay a little, save a
lot.

> Spammers, through the mass availability of open proxies and relays,
> compromised clients, and other things we wish didn't exist, have far
> more resources to send our mail, than any one organisation does to
> receive (to the best of my knowledge).


Sure, but that's more or less irrelevant. This part of the problem is
solved by identification and denial of authorisation -- i.e. reject
their transactions before they can further impact your limited
resources. Open proxies and open relays are relatively easily
mechanically identified in a completely impartial manner and there are
several well maintained lists of them. The important thing here is to
realize that when rejecting their connections you really must employ
error response rate limiting. This is because many open relays, and
most open proxies, are prone to exactly the kind of problem that we
started down this thread with -- they will unwittingly hammer on your
server if you send them a 5xx response that they don't honour. This
only stands to reason because any server that's an open relay is liable
to have other implementation bugs as well.

> So the obvious question on the minds of some readers will be "Well, why
> not set Exim up to accept 20 connections? Then you can last twice as
> long!".


If you've actually implemented error response rate limiting fully and
properly as I have then you'll soon realize that this is almost exactly
the right solution, though not quite right -- maybe 12 or 15 would be
the right number given your hypothetical scenario and assuming you have
implemented some ACLs which might actually trigger such rate limiting).

> The answer quite simply, is that most admins will set their servers up
> to accept the maximum number of connections they can.
> After that, they simply cannot accept more connections.


I think you need to go learn more about tuning multi-user systems and
servers.

First off, that's really not how most admins tune their servers -- at
least none who are experienced at tuning servers will do this. The
problem becomes quite apparent as soon as you've encountered critical
failures caused by making such a mistake.

Secondly if you've paid any attention at all to how processes holding
idle connections behave in a busy system you'll realize that the number
of idle connections you can manage is orders of magnitude higher than
the number of non-idle connections you can manage.,

> > Have you got actual stats for any such machine or group of machines?
>
> What specifically are you looking for? I have many stats :-)


Well we need to start by identifying the ratio of normal connections
vs. those which trigger some kind of error response. Then we need to
look at how many of those errors appear to have been ignored by the
client -- indicated by an a re-connect attempt within the same period
when a an error response rate limiting delay would have prevented any
normal SMTP client from re-connecting.

I repeat: If your 300 conn/s machine is rejecting any significant
number of the transactions attempted in those connections (e.g. say 30%
or so) then by implementing full error response rate limiting you will
probably end up with lots of surplus server capacity, and you certainly
won't have to add any more RAM to handle the result. The more
transactions you reject, slowly, the more you save. The fact that
you've got 300 connections per second just makes it more likely, not
less likely that this will be the result. My connection per second or
so example machine is actually less likely to encounter problematic
clients than your monster server is.

The best part though is that since I got all the bugs worked out of my
error response rate limiting I've not had to firewall any misbehaving
clients that were effectively purpetrating denial of service attacks in
the past. Error response rate limiting really is a cost effective way
to protect against DoS, be it caused by broken software or spammers.

--
                                Greg A. Woods


+1 416 218-0098;            <g.a.woods@???>;           <woods@???>
Planix, Inc. <woods@???>; VE3TCP; Secrets of the Weird <woods@???>