Autor: Marc Sherman Data: A: Alun CC: exim-users Assumpte: Re: [exim] Bayesian filtering idea
Alun wrote: >
> In my bayesian table, it has probability exactly 0.5.
>
> Now, turn it on its head. Given that a message doesn't contain the word
> "the", what's its probability of being spam? Not 0.5, I'm willing to bet.
>
> What's the use of this? Well... my existing table tells me the words that
> have probability 0.5 and high hit rates ("you", "if", "your", "and", "of",
> "is", "will", "have", "it", "right"). So I've got exactly the data that I
> need to generate the inverse list, and I can maintain this dynamically.
> Suddenly the absence of these words might start to be discriminating. Would
> that be a pain for the spammers? I don't know.
Wow. The more I think about this idea, the more I like it.
The bayesian filter already knows what the very common tokens are, so it
can automatically decide which tokens it wants to track "missing"
counts/probabilities for.
You don't happen to be a perl hacker, do you? I bet this would be easy
to hack into Spamassassin's bayes filter.