[Spambayes] training problem?

Seth Goodman nobody at spamcop.net
Wed Dec 3 23:01:49 EST 2003

[Skip Montanaro]
> The idea is that you train on one or a few of your lowest scoring spams
> and/or highest scoring hams, save your unsure file, then run the above
> again.  Any previously "unsure" spams which now show up at the spam end of
> things get ignored.  Lather, rinse, repeat.  When you're tired of the
> cleansing cycle (or your hair is squeaky clean), rename your
> unsure folder,

OK, I've just done your process manually through the Outlook plug-in.  I
started with an initial training set of about 150 each of spam and ham (one
day of spam and a week of ham from about a month ago).  I then repeated
filtered a corpus of about 4,500 spam and 1,500 ham (the ham goes back much
further in time), added the highest scoring ham and lowest scoring spam to
the training set (~50 messages at a time), retrained, filtered, repeat until
brain dead.  I did this until all ham scored 0 and all but three spam scored
at least 90.  The final training set was 525 ham and 548 spam.  Therefore, a
training set about 15% of the corpus size gave an a posteriori
classification accuracy of 100% with only 0.05% unsures.  Of course, the a
priori performance can't stay that good, but it is still impressive.  I have
set my thresholds at 90/5 and will continue to train on all errors and
unsures.  I'll keep statistics and see how it goes.

This does show, as you suggested, that a smaller subset of spam (and ham)
can supply the tokens to get very good classification, at least a
posteriori.  Lets see how this works as an a priori predictor.  I bet it
will work great.

Thanks for the training algorithm.  I think the key is to start with a small
initial training set and then continually add the "outliers".  Doing this
then turns some correctly classified messages into outliers, and you then
add them to the training set and recurse until you have a good a posteriori
classifier.  If I did this with less than 50 messages at a time, I probably
would have ended up with a smaller training set, but this was time intensive
enough.  If this turns out to be a good a priori predictor of spam/ham, this
training method could be automated based on your scripts.

Thanks again.

Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above

More information about the Spambayes mailing list