[Spambayes] Does SpamBayes support automatic selective training?

Jesse Pelton jsp at PKC.com
Thu Jan 3 20:13:46 CET 2008


Do you have reason to believe that incremental training on messages that
you're currently receiving would be ineffective?  I retrain from scratch
periodically, and I generally find that a remarkably small corpus (maybe
a total of couple of dozen messages trained) is effective.  I retrain in
part because I suspect that the content of spam that I receive changes
over time, so training performed on messages from the distant past (say,
six months ago) may be irrelevant or worse for my current message
stream.

One of the counter-intuitive things about SpamBayes is how little data
it needs to go on.  This makes retraining fast, easy, and (for me, at
least) perversely rewarding.

-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org]
On Behalf Of gpr
Sent: Thursday, January 03, 2008 1:44 PM
To: spambayes at python.org
Subject: [Spambayes] Does SpamBayes support automatic selective
training?


Hi,

I have a large no of good and spam messages (few thousands) collected
over a
year and would
like to use these for initial training. But I know that it is preferable
to
train with only small subset of these messages (may be a thousand - 500
spam
and 500 ham) to keep my training db minimal,fast and effective.

My query is....do I need to manually pick out some thousand latest
messages
from this large corpus and input to SpamBayes or Can SpamBayes
automatically
(in fact smartly) do this job for me when given the entire set and a
required corpus size?

If this feature is not available would this not be a hell of useful
feature
to support? Ok, why I think manual classification - just picking up the
latest 1000 messages (for a corpus size 1000) from my large corpus- may
not
be much effective :

Not all the messages from the corpus may need to be trained ( using
train on
error+unsures strategy) , for example if the last hundred good messages
I
received are of the same type (ex:a long running thread about a specific
topic)...then SpamBayes can easily classify any future message of this
type
by just training on small part of these messages...So to get to a
message
corpus size of 1000 messages (and to train SpamBayes over a wide
coverage of
spam and ham message types), I may need to repeat the training multiple
times with different subsets until I 
achieve an effective corpus.

Hope I have explained my query clearly...Pardon me for any ignorance.

Thanks for clarifications in advance.

Ram


-- 
View this message in context:
http://www.nabble.com/Does-SpamBayes-support-automatic-selective-trainin
g--tp14602895p14602895.html
Sent from the Spambayes - General mailing list archive at Nabble.com.

_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html


More information about the SpamBayes mailing list