[Spambayes] A Couple of Training Questions

skip at pobox.com skip at pobox.com
Tue May 8 17:03:14 CEST 2007


    Dave> Q1:

    Dave> I have a cron job that runs sb_imapfilter.py to train periodically
    Dave> from my ham/spam corpus folders.

    Dave> AFAICT, that will train only as-yet-untrained messages.  I know
    Dave> there's supposed to be something about keeping ham and spam
    Dave> balanced. If I start out with 1000 messages in each folder, then
    Dave> dump 10 into just the ham folder, the next training run will train
    Dave> 10 hams and no spams.  Is that very bad for future performance, or
    Dave> is that temporary imbalance strongly mitigated by the overall size
    Dave> of the two folders?

Balance is overall, not per training run.  1010:1000 would be almost
perfectly balanced.  I think the concensus is to only train on unsures and
mistakes as well.  That might reduce the size of your training database
significantly.

    Dave> Q2:

    Dave> I notice that the incremental training of sb_imapfilter trains all
    Dave> (as-yet-untrained) hams, then all (as-yet-untrained) spams.
    Dave> However, Skip's train-to-exhaustion script tries to interleave
    Dave> training of Hams and Spams.  Is that interleaving only important
    Dave> for train-to-exhaustion, or should all methods use it?

Different goals.  Yes, the interleaving is only important to the
train-to-exhaustion scheme.  Suppose I train six messages like this:

    H S H S H S

If I next train a ham in sb_imapfilter it will just update the database with
the new info.  The tte script scores the message first, and only uses it to
supplement the training database if it is scored incorrectly.

Skip



More information about the SpamBayes mailing list