Author: W B Hacker Date: To: exim users Subject: Re: [exim] MIME parts and sa-learn
Daniel Tiefnig wrote: > Hej,
>
> I am thinking of a clever way to integrate spamassassin's sa-learn
> (bayesian classifier training program) into exim's ACLs. The intended
> approach is to pass the message which should be trained as a
> "message/rfc822" attachment (so original headers are preserved) to a
> specific address (e.g. sa-learn-spam@domain) at the server.
>
> Therefore, the first thing I was looking at was the smtp_mime ACL, but
> it doesn't seem to be of much use besides filtering for regular
> expressions. If the "malware" condition would be allowed, I could pass
> the attachment via a "cmdline" scanner to sa-learn, but according to the
> docs this isn't possible.
>
> It is of course possible to pass the whole message to a script in the
> data ACL or in a transport and demime everything in the script, but I
> don't really like that much.
> Another approach would be to demime into unique files in the mime ACL
> and read these files in a scanner/delivery script, but that's even
> worse, IMO.
>
> I'm sure people are using spamassassin a lot out there, so can anyone
> here show a smarter way of integration spam/ham learning? (Without using
> spamc or sth. else from the user side.)
>
> TIA & br,
> daniel
>
CAVEAT: Take this as a 'contrarian' observation w/r auto-learning and local
server spam/ham classifying in general.
- IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user server has
two drawbacks:
-- It uses a great deal of machine resources compared to a multitude of simpler
and more repeatable/predictable means of filtering.
-- it can be confused by per-user differences, not only as to what one user
consders spam and another does not (quasi-legit adverts, supermarket, bookstore,
airline and travel 'bargains' etc), and the very nature of the traffic different
users expect (active in social networks, retired vs active business contacts,
family & friends vs professionals, et al).
So Spam-Bayes and friends can easily get it 'wrong' if applied system-wide, yet
may need even greater resources if they are to be applied per-recipient - not
easily done in the requisite DATA phase anyway - at least not as to rejection vs
mere demerit scoring.
Conversely, Bayesian filtering seems to be at its best when applied in the
end-user's MUA, where there it is always 'per-recipient' specific, AND has at
least 'momentary' access to a generally greater chunk of processing power than a
server might be able to spare at busy times.
Next is the general 'need' to reinvent the classification anyway. It might have
a better payoff to utilize SA for all EXCEPT Bayesian / 'learning'/ AWL, and
add, for example, DSPAM, wherein a broader global dataset of spam vs ham
'fingerprints' can be applied with less total effort than developing your own on
the fly.
Either way, our experience has been that there is more than enough information
available to identify the unwanted so as to not need either SA's Spam-Bayes or
DPSAM.
Messages that evade interception by simpler means are few enough to not justify
the extra complexity - and maintenance - otherwise required, even when plentiful
machine resources are on-tap.
Note the relatively modest scores SA assigns by default even when SA-Spam-Bayes
is used. Not really in the front lines of defense - though one can, of course,
make it such.
Finally, to the extent that all other filters are working well, AND rejecting
in-session, not just scoring and onpassing, there can be a scarcity of spam on
which to train Bayesian filtering. Carrying such traffic 'deeper' into DATA
phase, so that Bayes can 'sniff' it to broaden its dataset, also adds workload
when it could have been rejected earlier.
After extensive tests, including saving folders full of known-spam for training,
we've given it up as too marginal to be useful, (ditto greylisting), and have
now had Spam-Bayes switched OFF for many years.