[exim] Bayesian filtering idea

Top Page
Delete this message
Reply to this message
Author: Alun
Date:  
To: exim-users
Subject: [exim] Bayesian filtering idea

Dear all,

A simple idea I've just had. The concept might be in SpamAssassin
anyway, but I use a separate Bayesian filter on our mail system,
and thought it might be interesting...

We've recently noticed that a reasonable amount of mail comes in without a
Date: header [1]. My reading of RFC2822 says it should be present, but I
don't want to risk outright blocking of messages without it. However, most
such messages from outside appear to be spam.

Our Bayesian filter could make use of the absence of the header but, of
course, if the header's not there, it's not a token that can have a
probability assigned to it.

So... I'm changing our Bayesian filter so that it adds tokens into the
database when a header is missing. Before tokenising a message, I initialise
an array of common mail headers. When I encounter one of these headers, I
delete it from the array. At the end of header processing, I therefore have
a list of headers that the message *doesn't* have. I then throw these into
the database with a special prefix ("M:Date" for instance).

At the moment, I'm only using "Date:" and "To:", but right now I've got
a script running over all my recent mail to get a full list of common
headers. Throwing all of them in could be instructive, since the absence of
some headers will be good pointers that a message is ham rather than spam.

Anyway, current figures, gathered over the past 24 hours, training on the
results of our other anti-spam measures:

+--------+-----------+----------+-------------------+
| token  | spamcount | hamcount | prob              |

+--------+-----------+----------+-------------------+
| M:Date |       496 |      201 | 0.814702104203042 |
| M:To   |       183 |      156 | 0.676699069683468 |

+--------+-----------+----------+-------------------+

So, mail without a date: header is 81% likely to be spam, as rated by
SpamAssassin et al. What's the chance that a large portion of the remainder
are spam that wasn't spotted?

Cheers,
Alun.

[1] Mainly because we're handling mail generated by a large, expensive
piece of software, which doesn't add it, and whose authors are clueless,
won't accept that Date: should be there, especially on a time-sensitive item
of mail like their system generates. They are blaming the client software
when it can't sort these undated mails into date order. Sigh.