Re: [exim] Best bogofilter integration (for after-SMTP time)…

Top Page
Delete this message
Reply to this message
Author: Greg Louis
Date:  
To: Exim User List
Subject: Re: [exim] Best bogofilter integration (for after-SMTP time)?
On 20050204 (Fri) at 1803:16 -0500, Greg A. Woods wrote:
> [ On Friday, February 4, 2005 at 02:04:12 (+0100), Axel Thimm wrote: ]
> > Subject: [exim] Best bogofilter integration (for after-SMTP time)?
> >
> > I've seen two way of integrating bogofilter:
> >
> > o as a transport_filter for (almost) each transport
> > o as a transport of its own resubmitting the mail into exim's queue
> >
> > Both methods are kind of hackish.
>
> Neither method will work very well either unless maybe all the users get
> very similar mail all of the time.


Depends very greatly on what constitutes "very well" in your
environment. If you need fewer than one in a myriad false positives
and half a percent false negatives (I'm getting figures like that with
80 quite disparate users and a single training database, and
preliminary trials at York U. in Toronto suggest that 60,000 mail
accounts can be handled almost as well; see also
http://jblosser.firinn.org/bogopaper ), then you may come to feel that
bogofilter isn't working "very well" for you.

(BTW the false positives I see are almost exclusively things like hotel
or airline reservation confirmations with embedded marketing drivel, or
newsletters that read like advertisements -- stuff any filter is likely
to have trouble with.)

> Filters that use statistical analysis and word or token lists only work
> really well if they can be targetted to individual users so that each
> recipient mailbox has its own private token list, and so that each user
> has a direct and easy way to re-train it when it makes a mistake. Note
> that Bogofilter in particular can quickly get out of control if it
> starts to make mistakes and isn't retrained properly.


The first of those two sentences depends on what constitutes "really
well" -- many of the habitués of the bogofilter mailing lists seem
happy with the level of accuracy bogofilter achieves for them in
single-database environments.

> You might consider using DSpam instead. It does even more than
> bogofilter in terms of analysis and claims a much lower error rate even
> without retraining (it uses CRM114), and it was designed to work well
> and properly at the LDA level so that each user has a private wordlist
> (i.e. a more accurate and private statistical model of their normal
> e-mail traffic).
>
>     http://www.nuclearelephant.com/projects/dspam/


I've certainly no quarrel with the suggestion of checking out DSpam,
but that "more accurate and private statistical model" claim is based
upon what may be a false assumption, namely, that every individual user
receives such a volume of spam and nonspam email that enough messages
to form an accurate model can readily be accumulated.

--
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |