[exim] Bayes - how bad is a small ham corpus with a big spam corpus?

Autor: srunschke
Datum:
To: exim-users
Betreff: [exim] Bayes - how bad is a small ham corpus with a big spam corpus?

Hi list,

I'm currently trying to build up a new bayes DB here, since the autobuilt
DB fubared (as expected, no need to throw things at me ;)). It's rather
easy
to build up the spam part, as we are getting right enough of it, yet it
poses
a problem to build up the ham part.
Much of our mail coming from relationed companies or customers comes
directly via Lotus Notes replication, so nothing to feed there. Much of
the
inbound smtp mail either contains private or confidential information, so
I cannot use them as I keep the source of the bayes messages in a Notes
DB serverside - I'd run into privacy issues.

So much for where I'm coming from, but now the question is:

Will a small ham corpus - let's say we take the minimum of 200 for the
beginning - compared to a fast growing spam corpus (currently at around
2000 spam) be a problem and possibly lead to false bayes scoring?
I most certain that there is this possibility of course - it's natural -
but the
question is how bad could it influence the scoring and how high is
the propability (aproximately)?

Any insights on this would be most appreciated.

regards
        sash

--------------------------------------------------
Sascha Runschke
Netzwerk Administration
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:SRunschke@abit.de

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html

Diese Nachricht ist Teil des folgenden Threads:
	Der komplette Thread sortiert nach Datum

	srunschke am

[exim] Bayes - how bad is a small ham corpus with a big spam…