[exim] New spam filtering trick - possible breakthrough in s…

Top Page
Delete this message
Reply to this message
Author: Marc Perkel
Date:  
To: dev, exim-users
CC: 
Subject: [exim] New spam filtering trick - possible breakthrough in spam filtering
OK - so - I've been talking about this for over a year - but I finally
found a way to try it out and - IT WORKS!

I'm running a second bayesian filter - using spamprobe - but I'm not
feeding the entire message into it. I'm only feeding the headers - not
the body of the message. I am scanning the body and also feeding in any
links in the body and email addresses in the body. Will probably include
phone numbers as well - but that's all. I am also using Exim to
"enhance" the headers. I'm looking up DNS information on the senders
domain - their MX servers, the zone information on the connecting IP
address and a few other things to make the headers have a lot more info
to work with.

And - it's working EXTREMELY WELL.

Why - you might ask - does it work better with less information?

Different parts of the message are spammier than other parts. The most
spammy part of the message is the message headers, especially the
subject line, and the URLs that it links to. Generally spam isn't sent
the same way that ham is sent and the bayesian filter can catch that. So
what I'm doing is only looking at the hottest parts of the email and
disregarding most of the body.

One of the immediate advantages is that messases that contain random
text to confuse bayesian filters have no effect on this one. And if
someone gets a spam and forwards it to me - it's not going to score very
high. And it works so well that the rest of you developers should really
look into this and do it right.

So - you may ask - how did I implement this?

I'm using Exim and Spam Assassin and using Spamprobe as the second
bayesian database. Spamprobe is simple to implement and interface. What
I do is I take the messages coming out of spam assassin and look for the
autolearn tags so that Spamprobe is trained on the same messages that
Spam Assassin is trained on. I have IMAP feedback folders as well so
users can drage spam into spam-missed folders and I pick those up and
train SA and Spamprobe on these as well.

Messages going into Spamprobe are first run through a perl script that
removes the message body exceot for email addresses and links. So
spamprobe is trained on the same messages - but only part of the message.

New email coming in is first tested with spamprobe to see how it scores.
Again - only the headers and links are tested. Spamprobe returns a
number between 0 and 1 with 0 = ham and 1 = spam. I pipe the result into
another perl script that returns a header with 9 different words as to
what the result is. The middle 50% is neutral. The next 15% on both
sides (10-25 and 75-90) is low. The next 9% (1-10 - 90-99) is high. The
next 0.9% is very high - and the last 0.1% is extreme.

These words are added through a header on the way into spam assassin and
spam assassin scores them. I've assigned scores as follows:

score SP_HAM_EXTREME            -8
score SP_HAM_VERY               -5
score SP_HAM_HIGH               -2
score SP_HAM_LOW                -1
score SP_SPAM_LOW               1 
score SP_SPAM_HIGH              2
score SP_SPAM_VERY              5 
score SP_SPAM_EXTREME           8


What I am seeing is that although spam Assassin's bayesian filter is
pretty good - it's not as good as the spamprobe filter when it's fed
with only the hottest part of the message. The way I cobbled this
together probably isn't the best way to do it - but it is good enough to
show me that the concept does work. It is working so well that my
overall accuracy - which includes all the tricks I'm using - is now
almost 100%.

I'm still tweaking this but I am happy to share with anyone interested
what I'm doing and how I'm doing it. And I want to encourage everyone to
look into this idea of using partial message in bayesian filtering. I'm
running both filters now. I don't know yet if both are necessary in the
long run. I like the idea of two filters looking at different data. It
makes me wonder about having miltiple filters all looking a different
parts of the messages independently and then scoring them all separately.