[Spambayes] A Couple of Training Questions
skip at pobox.com
skip at pobox.com
Tue May 8 17:03:14 CEST 2007
Dave> Q1:
Dave> I have a cron job that runs sb_imapfilter.py to train periodically
Dave> from my ham/spam corpus folders.
Dave> AFAICT, that will train only as-yet-untrained messages. I know
Dave> there's supposed to be something about keeping ham and spam
Dave> balanced. If I start out with 1000 messages in each folder, then
Dave> dump 10 into just the ham folder, the next training run will train
Dave> 10 hams and no spams. Is that very bad for future performance, or
Dave> is that temporary imbalance strongly mitigated by the overall size
Dave> of the two folders?
Balance is overall, not per training run. 1010:1000 would be almost
perfectly balanced. I think the concensus is to only train on unsures and
mistakes as well. That might reduce the size of your training database
significantly.
Dave> Q2:
Dave> I notice that the incremental training of sb_imapfilter trains all
Dave> (as-yet-untrained) hams, then all (as-yet-untrained) spams.
Dave> However, Skip's train-to-exhaustion script tries to interleave
Dave> training of Hams and Spams. Is that interleaving only important
Dave> for train-to-exhaustion, or should all methods use it?
Different goals. Yes, the interleaving is only important to the
train-to-exhaustion scheme. Suppose I train six messages like this:
H S H S H S
If I next train a ham in sb_imapfilter it will just update the database with
the new info. The tte script scores the message first, and only uses it to
supplement the training database if it is scored incorrectly.
Skip
More information about the SpamBayes
mailing list