Re: [exim] Greylisting algorithms after end of DATA

Autor: Magnus Holmgren
Data:
Para: exim-users
Assunto: Re: [exim] Greylisting algorithms after end of DATA

On Saturday 13 January 2007 17:20, Peter Bowyer wrote:
> On 13/01/07, Magnus Holmgren <holmgren@???> wrote:
> > Traditional greylisting combines the remote host, envelope sender, and
> > envelope recipient and checks if that triplet has been seen before (not
> > too long ago but also at least some time ago) after each RCPT command.
> > (Correct me if I'm wrong.) The advantage is that it saves bandwidth.
>
> Saves resources. Bandwidth is one resource, but probably not the
> primary one in this context.

I won't argue with that, but again, it's a tradeoff. Indiscriminate
greylisting can be annoying at times. I didn't really intend to discuss the
pros and cons of greylisting before vs. after DATA, but rather how
greylisting after DATA can be optimised (mainly because that's the only way
SA-Exim can work).

> > Running SpamAssassin after end of DATA but before accepting the mail
> > gives the advantage that greylisting can be applied only to grey mail -
> > the delaying of clearly non-spam mail can be avoided.
>
> But for most people, running SA is the most expensive test they do,
> and they move it to last place in the chain for this reason.
> Greylisting is seen as a cheap way of turning away likely spam without
> having to go to the expense of content-scanning it. If SA is involved
> in the greylisting algorithm, the resource saving it delivers is
> significantly reduced. That is, unless the resulting improvement in
> the algorithm leads to better whitelisting and less SA work later.

I should point out that by "whitelisting" in this context I mean "not
subjecting to delay a second time", not that the sender is thought not to
send any spam at all.

> > It also means that e.g. the Message-ID
> > can be considered when determining whether we have seen the message
> > before.
>
> Does this have any correlation to whether the message is spam or not?
> If not, I'm not sure it helps....

Potentially. If a spammer/zombie does two spam runs with the same sender and
recipient but different contents, we might then notice that he did not, in
fact, retry. I don't know how likely this scenario is, though. In any
case, "does/does not retry" is largely a property of the host i.e. IP
address, not the sender or recipient. I can see how there might be a point in
delaying mail from unknown senders at otherwise legitimate sites such as
Hotmail or Yahoo - the account might be terminated or URIs in the spam
blacklisted during that delay. But how long would the delay have to be to
have any significant effect?

> > In fact, nothing prevents us from using an arbitrary set of header fields
> > (such as Subject, Message-ID, From) in constructing the key, if it gives
> > better confidence in what we want to know: whether the other end retries
> > after a temporary failure. (We could even accept delivery and whitelist
> > based on a partial match, say 3 of 4, to better cope with the braindead
> > mail servers that unfortunately exist.) After we have determined that it
> > does, there's no reason to greylist further mail. (Well, there might be a
> > reason to delay mail from new senders at large ESPs like Hotmail, if that
> > means that URIs in the spam get the time to end up in URIBLs. This is
> > open to discussion.)
>
> Hmm. I can't see what aspect of traditional triplet-based greylisting
> you're improving with this. It seems to be the automatic whitelisting
> after a successful retry - but which aspects of the sender are you
> then able to whitelist more accurately as a result? Especially since
> the use of SA has added cost, you'd need to be clearly saving cost
> somehow.

To save resources you could implement traditional greylisting conditionally as
well, for example by greylisting hosts listed in DNSBLs you don't trust
enough to outright block on.

-- 
Magnus Holmgren        holmgren@???
                       (No Cc of list mail needed, thanks)

"Exim is better at being younger, whereas sendmail is better for
Scrabble (50 point bonus for clearing your rack)" -- Dave Evans

Esta mensagem é parte da seguinte discussão:
	a lista completa das discussões ordenadas por data
	Peter Bowyer em
	David Woodhouse em