[ On Friday, February 4, 2005 at 18:55:05 (-0500), Greg Louis wrote: ]
> Subject: Re: [exim] Best bogofilter integration (for after-SMTP time)?
>
> Depends very greatly on what constitutes "very well" in your
> environment. If you need fewer than one in a myriad false positives
> and half a percent false negatives (I'm getting figures like that with
> 80 quite disparate users and a single training database, and
> preliminary trials at York U. in Toronto suggest that 60,000 mail
> accounts can be handled almost as well; see also
> http://jblosser.firinn.org/bogopaper ), then you may come to feel that
> bogofilter isn't working "very well" for you.
Well, I know from about a year's worth of daily experience with upwards
of a thousand messages per day that my own bogofilter instance does not
work very well (i.e. gets exponentially less accurate -- or nearly so)
unless I re-train it as soon as possible. I don't know that it's ever
given any significant false positives per se, but I've always been
pretty conservative with it and I now use ham_cutoff=0.50 and
spam_cutoff=0.90. Then again I'm the typical geek/dweeb/freak user with
very predicatable e-mail usage patterns. :-)
One thing I know for certain though -- I would not ever want any other
user to be able to retrain my bogofilter for me. I've tried some
experiments with using other people's mail for re-training and at best
the result is a drastic reduction in accuracy (i.e. a dramatic climb in
"Unsure" results). Admittedtly though the false positive rate didn't
increase, but then those who's e-mail archives I have access to tend to
have similar, predictable, usage patterns as I do (though not exactly
matching my own). That's also assuming no other user has malicious
intent. I think any attempt at centralized word/token list maintenance
either requires a great deal of trust or even a moderator -- and neither
are available or even possible in any typical ISP environment.
(DSpam has centralized re-training hooks as well of course, but they
appear to be quite advanced in terms of being sensitive to user-specific
attributes -- or at least that's how I'm hoping they work based on what
I've read so far.)
Luckily when statistial analyzers are built into MUAs (e.g. as I have
bogofilter hooked into Emacs VM) they do inherently have private
word/token lists and they also have _MUCH_ better retraining interfaces.
That benefit alone is a very strong argument against using any filter
that requires training and retraining in a centralized fashion. Every
time I've talked to anyone (users, ISPs, postmasters) about any kind of
retraining hooks to a centralized filter of any kind their eyes often
gloss over before I'm beyond the first step.
One thing I don't like about bogofilter, even for personal use, is that
it's getting pretty bloaty and expensive to run with recent releases.
I've had to pump my db_cachesize up to 128MB to get it to run at any
decent speed while filtering a big chunk of new mail and it still takes
a very _long_ time to feed one message through it again for retraining,
Most of the overhead seem to be in I/O to the wordlist DB. That's with
0.92.6 and BerkeleyDB 4.2.52 -- perhaps there's a lighter-weight,
lower-I/O demanding alternative for large-scale use? I also note that
the large-scale deployments mentioned in the jblosser paper you point to
were done with much older releases. I do very much like the idea of
running the data through a central filter at SMTP-time so that unwanted
mail can be immediately rejected, but I know I wouldn't be able to
afford to do that for any sizable throughput using bogofilter in the way
I'm using it now, particularly not if I had to start it once for each
incoming connection (as a running daemon it might suffice).
I.e. the claimed I/O profile of DSpam is also another major attractant
-- anything to reduce the I/O on a mail server, or at least not add to
it (much), is a good thing! ;-)
(BTW, is anyone working on fixing bogofilter so that it can properly
parse modern *BSD-style mbox files -- i.e. the kind where a all messages
are separated by a blank line and start with a "From " header and all
following headers are proper RFC-[2]822 syntax, and thus which do not
have any the unnecessary ">From " stuffing? The current folder parser
is still stuck in the 1970's. ;-)
--
Greg A. Woods
H:+1 416 218-0098 W:+1 416 489-5852 x122 VE3TCP RoboHack <woods@???>
Planix, Inc. <woods@???> Secrets of the Weird <woods@???>