[Spambayes] Does SpamBayes support automatic selective training?

Thu Jan 10 01:05:00 CET 2008

on Thu Jan 03 2008, gpr <grp_eee-AT-yahoo.com> wrote:

> Hi,
>
> I have a large no of good and spam messages (few thousands) collected over a
> year and would
> like to use these for initial training. But I know that it is preferable to
> train with only small subset of these messages (may be a thousand - 500 spam
> and 500 ham) to keep my training db minimal,fast and effective.
>
> My query is....do I need to manually pick out some thousand latest messages
> from this large corpus and input to SpamBayes or Can SpamBayes automatically
> (in fact smartly) do this job for me when given the entire set and a
> required corpus size?
>
> If this feature is not available would this not be a hell of useful feature
> to support? Ok, why I think manual classification - just picking up the
> latest 1000 messages (for a corpus size 1000) from my large corpus- may not
> be much effective :
>
> Not all the messages from the corpus may need to be trained ( using train on
> error+unsures strategy) , for example if the last hundred good messages I
> received are of the same type (ex:a long running thread about a specific
> topic)...then SpamBayes can easily classify any future message of this type
> by just training on small part of these messages...So to get to a message
> corpus size of 1000 messages (and to train SpamBayes over a wide coverage of
> spam and ham message types), I may need to repeat the training multiple
> times with different subsets until I 
> achieve an effective corpus.

I use the train-to-exhaustion script, contrib/tte.py, whose "prune"
option can effectively remove the messages that don't make any
difference from your training set.

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com