[Spambayes] Does SpamBayes support automatic selective training?

Jesse Pelton jsp at PKC.com
Thu Jan 3 20:13:46 CET 2008

Do you have reason to believe that incremental training on messages that
you're currently receiving would be ineffective?  I retrain from scratch
periodically, and I generally find that a remarkably small corpus (maybe
a total of couple of dozen messages trained) is effective.  I retrain in
part because I suspect that the content of spam that I receive changes
over time, so training performed on messages from the distant past (say,
six months ago) may be irrelevant or worse for my current message

One of the counter-intuitive things about SpamBayes is how little data
it needs to go on.  This makes retraining fast, easy, and (for me, at
least) perversely rewarding.

-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org]
On Behalf Of gpr
Sent: Thursday, January 03, 2008 1:44 PM
To: spambayes at python.org
Subject: [Spambayes] Does SpamBayes support automatic selective


I have a large no of good and spam messages (few thousands) collected
over a
year and would
like to use these for initial training. But I know that it is preferable
train with only small subset of these messages (may be a thousand - 500
and 500 ham) to keep my training db minimal,fast and effective.

My query is....do I need to manually pick out some thousand latest
from this large corpus and input to SpamBayes or Can SpamBayes
(in fact smartly) do this job for me when given the entire set and a
required corpus size?

If this feature is not available would this not be a hell of useful
to support? Ok, why I think manual classification - just picking up the
latest 1000 messages (for a corpus size 1000) from my large corpus- may
be much effective :

Not all the messages from the corpus may need to be trained ( using
train on
error+unsures strategy) , for example if the last hundred good messages
received are of the same type (ex:a long running thread about a specific
topic)...then SpamBayes can easily classify any future message of this
by just training on small part of these messages...So to get to a
corpus size of 1000 messages (and to train SpamBayes over a wide
coverage of
spam and ham message types), I may need to repeat the training multiple
times with different subsets until I 
achieve an effective corpus.

Hope I have explained my query clearly...Pardon me for any ignorance.

Thanks for clarifications in advance.


View this message in context:
Sent from the Spambayes - General mailing list archive at Nabble.com.

SpamBayes at python.org
Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

More information about the SpamBayes mailing list