Re: [exim] Bayesian filtering idea

Top Page
Delete this message
Reply to this message
Author: Alun
Date:  
To: Marc Sherman
CC: exim-users
Subject: Re: [exim] Bayesian filtering idea
Marc Sherman (msherman@???) said, in message
    <42383500.2050202@???>:

>
> > Our Bayesian filter could make use of the absence of the header but, of
> > course, if the header's not there, it's not a token that can have a
> > probability assigned to it.
>
> Wouldn't an unmodified bayesian filter still account for that by
> assigning a negative probability to the "Date: " token?


Argh!

You might be right, but I can't see the flaw in my reasoning. This is going
to mess with my mind!

The way I see it, the vast majority of both ham and spam has a Date: header.
So, the probability assigned to "Date:" will be roughly 50%, like other
common words. In which case, it's no real use. In fact, "H:Date" has
probability 0.463 for me (which presumably reflects the fact that the
presence of a Date: header is a slight indicator of ham), and it's pretty
doubtful it takes any part in any calculations at the moment.

In the unmodified Bayesian filter, even if the value assigned to "H:Date"
was high or low enough to actually end up contributing, it clearly can't be
used at all if there's no Date: header in the message. All it can do is
confer legitimacy on any message that has it. It doesn't say anything about
a message that's missing the Date: header.

If my assumption (i.e. mails without a Date: header are mainly spam) is
right, then the token "M:Date" will be assigned a much higher probability
(i.e. Given that "Date:" is missing, what's the probability that the message
is spam?) and will actually start to be useful. This seems to be born out by
the fact that "M:Date" has probability 0.8 with me at the moment.

Where am I going wrong?!

Cheers,
Alun.

-- 
Alun Jones                       auj@???
Systems Support,                 (01970) 62 2494
Information Services,
University of Wales, Aberystwyth