[Spambayes] overtraining and retraining

skip at pobox.com skip at pobox.com
Sun Oct 16 18:45:29 CEST 2011


    >> 2. When I train over a message, I keep training in a loop until the
    >> message probability goes under 20% (ham) or over 90% (spam). As the
    >> database ages, training spam needs more "looping", that is, the
    >> probability goes up slowly. The ham training, nevertheless, is fast
    >> and the loop counting is low.

    Jesus> Uhm, the wiki says: "never train the same message
    Jesus> twice". Reason?. I am breaking this badly.

Jesus,

I use train to exhaustion as referenced in your other email (contrib/tte.py
in the SpamBayes distribution).  I currently have 21 hams and 17 spams in my
current training database.  I suggest you just toss out everything but the
most recent 10-15 hams and spams then start with that.

I cheat as well, since both my pobox.com mail forwarding service and Gmail
(where it forwards to) apply their own spam filters before SpamBayes gets a
crack at my mail.  The downside of that is that I need to scan their held
spams periodically.

Skip


More information about the SpamBayes mailing list