[Spambayes] Incremental Training

Kenny Pitt kennypitt at hotmail.com
Fri Sep 24 15:57:54 CEST 2004

Incremental training never completely ends, but the number of new messages
that need training will reduce drastically after a very short time. Most
people get 95+% accuracy after only a week or two of training on mistakes
and unsures. However, spammers are constantly advertising new scams or
modifying their message format to try to get around all the spam filters so
it is impossible for any filter to be 100% accurate.
Experience has shown that as long as you train only on the mistakes and
unsures, your database size should remain reasonably small. You would likely
have to train on thousands of messages before there would be any noticeable
slow-down in the SpamBayes processing.
This leads into your last question. The thing you want to avoid is
bulk-training on large numbers of messages, particularly if you are training
only one type of message such as all spam and no or very few good messages.
First, it unnecessarily increases the size of your training database.
Second, it can cause you to have significantly more trained messages of one
type than you have of the other.
The theories behind the SpamBayes filter would suggest that optimum
performance is achieved if the number of good messages trained and the
number of spam messages trained is about equal. Most people still see
excellent results if they have trained 5 or even 10 spam messages for every
good message. If your training gets more one-sided than that, there is a
good chance that your accuracy will start to decrease. But every user is
different and it seems that some people are still getting good results with
imbalances as high as 100 to 1 or more.
Kenny Pitt


From: Winoto Janputra [mailto:winotoj at Dorfin.com] 
Sent: Friday, September 24, 2004 8:24 AM
To: Kenny Pitt
Subject: RE: [Spambayes] Incremental Training

Hi Kenny,
Thanks for your reply.
I have another question, I know it's different for everybody but when I have
to stop the incremental training? I'm affraid if the database too big will
slowdown outlook.
We use ORF at server level but he still get around 10 spam everyday.
If you are using any of the incremental training methods above then there
should be no need to manually train on the entire contents of your spam
folder.  In fact, doing so could potentially reduce the effectiveness of the
SpamBayes filter (for mathematical reasons that I won't go into 
Which one reduce the effectiveness, incremental or rebuild?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040924/d23c73e1/attachment.htm

More information about the Spambayes mailing list