[Spambayes] overtraining and retraining
skip at pobox.com
skip at pobox.com
Sun Oct 16 18:45:29 CEST 2011
>> 2. When I train over a message, I keep training in a loop until the
>> message probability goes under 20% (ham) or over 90% (spam). As the
>> database ages, training spam needs more "looping", that is, the
>> probability goes up slowly. The ham training, nevertheless, is fast
>> and the loop counting is low.
Jesus> Uhm, the wiki says: "never train the same message
Jesus> twice". Reason?. I am breaking this badly.
Jesus,
I use train to exhaustion as referenced in your other email (contrib/tte.py
in the SpamBayes distribution). I currently have 21 hams and 17 spams in my
current training database. I suggest you just toss out everything but the
most recent 10-15 hams and spams then start with that.
I cheat as well, since both my pobox.com mail forwarding service and Gmail
(where it forwards to) apply their own spam filters before SpamBayes gets a
crack at my mail. The downside of that is that I need to scan their held
spams periodically.
Skip
More information about the SpamBayes
mailing list