Re: [Exim] [offtopic] anti-spam that actually works

Páxina inicial
Borrar esta mensaxe
Responder a esta mensaxe
Autor: Mark Morley
Data:  
Para: nbecker
CC: exim-users
Asunto: Re: [Exim] [offtopic] anti-spam that actually works
>
> My proposal is to build a site. It would set itself up to attract
> spam, but subscribing to lots of lists, etc. By keeping a hash (md5?)
> of messages, it would identify spam by the large number of identical
> hashes.
>


I did exactly this earlier in the year as an experiment. I created
several phony email addresses and hid them all over verious web pages
in HTML comments. Sure enough they soon started getting flooded with
spam.

Then I started processing each message automatically and generating an
MD5 string for them. I only used the body of the message - the headers
were not included in the calculation because they change too much. I
also converted all letters in the body to lower case and ignored any
character that wasn't a letter or a digit.

The resulting MD5 string was placed automatically into a DBM database.

Our users have a custom filtering language available to them. To
participate in the experiment all they had to do was add one rule:

delete if md5 is in md5-database

That caused the md5 string to be calculated for each incoming message
and looked up in a globally accessible database.

It was fairly successful in the cases where a spam happened to be
delivered to one of the spam-trap addresses first. But this didn't
happen often. That could be improved by setting up a lot more spam
traps, possibly at multiple providers and domains.

But spammers aren't totally stupid. A lot of spam I see now has a
unique "number" within the first or last couple of lines of the
body. This totally throws of the MD5 hash.

A solution that seemed to work for most of them was to exclude the
first and last few lines of the body when calculating the hash.

Another idea I had but didn't implement was to check each "word" in
the body against a dictionary and only include valid words in the
MD5 calculation. This isn't nearly as CPU intensive as it may sound.

Mark