Marc Sherman (msherman@???) said, in message
<42383500.2050202@???>:
>
> > Our Bayesian filter could make use of the absence of the header but, of
> > course, if the header's not there, it's not a token that can have a
> > probability assigned to it.
>
> Wouldn't an unmodified bayesian filter still account for that by
> assigning a negative probability to the "Date: " token?
Argh!
You might be right, but I can't see the flaw in my reasoning. This is going
to mess with my mind!
The way I see it, the vast majority of both ham and spam has a Date: header.
So, the probability assigned to "Date:" will be roughly 50%, like other
common words. In which case, it's no real use. In fact, "H:Date" has
probability 0.463 for me (which presumably reflects the fact that the
presence of a Date: header is a slight indicator of ham), and it's pretty
doubtful it takes any part in any calculations at the moment.
In the unmodified Bayesian filter, even if the value assigned to "H:Date"
was high or low enough to actually end up contributing, it clearly can't be
used at all if there's no Date: header in the message. All it can do is
confer legitimacy on any message that has it. It doesn't say anything about
a message that's missing the Date: header.
If my assumption (i.e. mails without a Date: header are mainly spam) is
right, then the token "M:Date" will be assigned a much higher probability
(i.e. Given that "Date:" is missing, what's the probability that the message
is spam?) and will actually start to be useful. This seems to be born out by
the fact that "M:Date" has probability 0.8 with me at the moment.
Where am I going wrong?!
Cheers,
Alun.
--
Alun Jones auj@???
Systems Support, (01970) 62 2494
Information Services,
University of Wales, Aberystwyth