Autor: Mark Morley Data: Para: nbecker CC: exim-users Asunto: Re: [Exim] [offtopic] anti-spam that actually works
> > My proposal is to build a site. It would set itself up to attract
> spam, but subscribing to lots of lists, etc. By keeping a hash (md5?)
> of messages, it would identify spam by the large number of identical
> hashes.
>
I did exactly this earlier in the year as an experiment. I created
several phony email addresses and hid them all over verious web pages
in HTML comments. Sure enough they soon started getting flooded with
spam.
Then I started processing each message automatically and generating an
MD5 string for them. I only used the body of the message - the headers
were not included in the calculation because they change too much. I
also converted all letters in the body to lower case and ignored any
character that wasn't a letter or a digit.
The resulting MD5 string was placed automatically into a DBM database.
Our users have a custom filtering language available to them. To
participate in the experiment all they had to do was add one rule:
delete if md5 is in md5-database
That caused the md5 string to be calculated for each incoming message
and looked up in a globally accessible database.
It was fairly successful in the cases where a spam happened to be
delivered to one of the spam-trap addresses first. But this didn't
happen often. That could be improved by setting up a lot more spam
traps, possibly at multiple providers and domains.
But spammers aren't totally stupid. A lot of spam I see now has a
unique "number" within the first or last couple of lines of the
body. This totally throws of the MD5 hash.
A solution that seemed to work for most of them was to exclude the
first and last few lines of the body when calculating the hash.
Another idea I had but didn't implement was to check each "word" in
the body against a dictionary and only include valid words in the
MD5 calculation. This isn't nearly as CPU intensive as it may sound.