Re: [exim] Bayesian filtering idea

Top Pagina
Delete this message
Reply to this message
Auteur: Alun
Datum:  
Aan: exim-users
Onderwerp: Re: [exim] Bayesian filtering idea
Me (auj@???) said, in message
    <E1DBc5k-000086-P1@???>:

>
> Now, turn it on its head. Given that a message doesn't contain the word
> "the", what's its probability of being spam? Not 0.5, I'm willing to bet.


[...]

> Suddenly the absence of these words might start to be discriminating. Would
> that be a pain for the spammers? I don't know.


Well, bang goes that idea. I've left it collecting stats over night, and
it's not all that convincing (see table at bottom). The probabilities *do*
move away from 0.5, but not far enough to make it worthwhile using them.
I'll probably tinker with it some more just for fun, but it's probably not
really worth it.

The "missing headers" check has turned up a few high and low probabilities,
and won't penalise non-English-language e-mails, so I think I'll leave
that in.

Cheers,
Alun.

Missing Headers:

+--------------+-----------+----------+--------------------+
| token        | spamcount | hamcount | prob               |

+--------------+-----------+----------+--------------------+
| from         |         9 |        1 |  0.934363660930122 |
| to           |       142 |      127 |  0.663827713807799 |
| date         |       441 |      402 |  0.661700664815925 |

[...]
| mime-version |       222 |     1694 |   0.18833103805181 |
| subject      |         5 |       73 |  0.109939934004354 |
| content-type |        64 |     1163 | 0.0878777698327971 |

+--------------+-----------+----------+--------------------+

[
Interesting. I'd have tended to guess that mails without a
Subject: were *more* likely to be spam, not less.
]

Missing Words:

+-----------+-----------+----------+-------------------+
| token     | spamcount | hamcount | prob              |

+-----------+-----------+----------+-------------------+
| the       |       661 |      801 | 0.595071978782059 |
| to        |       573 |      864 | 0.541479935056322 |
| in        |       878 |     1438 | 0.520779185129536 |
| of        |       810 |     1356 | 0.515322312740704 |
| is        |       889 |     1755 |  0.47409172380727 |
| and       |       685 |     1538 | 0.442175392032512 |
| on        |      1016 |     2373 | 0.432409231450976 |
| at        |      1389 |     3248 | 0.431840443797733 |
| for       |       824 |     1957 | 0.428333082213174 |
| with      |      1200 |     3305 | 0.392189229737764 |
| this      |       447 |     1262 | 0.386684396095635 |
| your      |       906 |     2585 | 0.384090301781086 |
| have      |      1223 |     3599 | 0.376779701925816 |
| from      |       892 |     2671 |  0.37272674712802 |
| by        |      1294 |     3914 | 0.370345427303499 |
| if        |       968 |     3078 | 0.358783650853309 |
| that      |      1201 |     3894 | 0.354299175300879 |
| right     |      1651 |     5399 | 0.352148392217373 |
| value     |      2008 |     6675 | 0.348441079108702 |
| we        |      1440 |     4836 | 0.346081131964434 |
| will      |      1472 |     4994 | 0.343786515116099 |
| or        |      1002 |     3431 | 0.341924137342655 |
| as        |      1357 |     4815 | 0.333728955897972 |
| all       |      1267 |     4612 | 0.328060595784779 |
| you       |       428 |     1609 | 0.321302533127908 |
| job       |      2042 |     7757 |  0.31879254543179 |
| not       |      1207 |     4593 | 0.318343504090853 |
| our       |      1114 |     4278 | 0.316595324425598 |
| please    |       956 |     3677 | 0.316264840237378 |
| cgi       |      2011 |     7800 | 0.314285497702035 |
| be        |      1110 |     4338 | 0.312571355216773 |
| it        |      1287 |     5084 | 0.310296762099719 |
| are       |      1020 |     4074 | 0.308157722607593 |
| size      |      1119 |     4837 | 0.291559780094477 |
| php       |      1655 |     7805 | 0.273738174638698 |
| email     |       912 |     4313 | 0.273086483253055 |
| mail      |      1351 |     7348 | 0.246290714573663 |

+-----------+-----------+----------+-------------------+

-- 
Alun Jones                       auj@???
Systems Support,                 (01970) 62 2494
Information Services,
University of Wales, Aberystwyth