[Spambayes] Does SpamBayes support automatic selective training?

gpr grp_eee at yahoo.com
Thu Jan 3 19:43:59 CET 2008


I have a large no of good and spam messages (few thousands) collected over a
year and would
like to use these for initial training. But I know that it is preferable to
train with only small subset of these messages (may be a thousand - 500 spam
and 500 ham) to keep my training db minimal,fast and effective.

My query is....do I need to manually pick out some thousand latest messages
from this large corpus and input to SpamBayes or Can SpamBayes automatically
(in fact smartly) do this job for me when given the entire set and a
required corpus size?

If this feature is not available would this not be a hell of useful feature
to support? Ok, why I think manual classification - just picking up the
latest 1000 messages (for a corpus size 1000) from my large corpus- may not
be much effective :

Not all the messages from the corpus may need to be trained ( using train on
error+unsures strategy) , for example if the last hundred good messages I
received are of the same type (ex:a long running thread about a specific
topic)...then SpamBayes can easily classify any future message of this type
by just training on small part of these messages...So to get to a message
corpus size of 1000 messages (and to train SpamBayes over a wide coverage of
spam and ham message types), I may need to repeat the training multiple
times with different subsets until I 
achieve an effective corpus.

Hope I have explained my query clearly...Pardon me for any ignorance.

Thanks for clarifications in advance.


View this message in context: http://www.nabble.com/Does-SpamBayes-support-automatic-selective-training--tp14602895p14602895.html
Sent from the Spambayes - General mailing list archive at Nabble.com.

More information about the SpamBayes mailing list