Me (auj@???) said, in message
<E1DBXuB-000GjS-Uq@???>:
>
> Our Bayesian filter could make use of the absence of the header but, of
> course, if the header's not there, it's not a token that can have a
> probability assigned to it.
OK, here's one step further.
Imagine an extremely common word. Take "the".
Well, it's equally common in spam and not spam, and so is very
undiscriminating, yes?
In my bayesian table, it has probability exactly 0.5.
Now, turn it on its head. Given that a message doesn't contain the word
"the", what's its probability of being spam? Not 0.5, I'm willing to bet.
What's the use of this? Well... my existing table tells me the words that
have probability 0.5 and high hit rates ("you", "if", "your", "and", "of",
"is", "will", "have", "it", "right"). So I've got exactly the data that I
need to generate the inverse list, and I can maintain this dynamically.
Suddenly the absence of these words might start to be discriminating. Would
that be a pain for the spammers? I don't know.
I'd like to imagine that it hits them both ways. If they leave the common
words out to try to avoid giving a bayesian filter anything to play with (I
certainly get lots of spam with a few random words that are perceived to be
low scoring, and an URL) the lack of these common words starts to raise the
probabilities. If they stack messages full of the common words to avoid
failing the "missing word" test, those words might become spammy, but that
won't matter for legitimate mails because these tend to have lots of other
hammy words to water it down.
I'm going to run it for a while with the words above and see what happens.
It might all be sunk by foreign language e-mails...
Andrew Rawlins, the other half of my team, suggested another useful one
that's not picked up by standard Bayesian tokenisation. The signature
separator, "\n-- \n". If a message has got it, it currently appears to be
very unlikely to be spam (spamcount=0, hamcount=36 in the past few minutes).
Cheers,
Alun.
--
Alun Jones auj@???
Systems Support, (01970) 62 2494
Information Services,
University of Wales, Aberystwyth