[Spambayes] overtraining and retraining
Jesus Cea
jcea at jcea.es
Sun Oct 16 17:30:36 CEST 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 16/10/11 17:09, Jesus Cea wrote:
> After a while the detection rate goes worse, slowly. And training
> get slower (the probability change after a training cycle moves far
> slower).
One point stressed frequently is that the number of ham/spams trained
should be similar. In my case I am not doing that. My current database
numbers are:
HAM: 150
SPAM: 27919
The counters are so unbalanced because:
1. I only train misclasifications and "unsures". The fact is that
misclassifications are rare (thanks!) and >99% of "unsures" are spam.
2. When I train over a message, I keep training in a loop until the
message probability goes under 20% (ham) or over 90% (spam). As the
database ages, training spam needs more "looping", that is, the
probability goes up slowly. The ham training, nevertheless, is fast
and the loop counting is low.
Suggestions?
- --
Jesus Cea Avion _/_/ _/_/_/ _/_/_/
jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/
. _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQCVAwUBTpr4nJlgi5GaxT1NAQKTiAP+MfyHr2cY7i64dNSex+6OmSgmVNwXPNwk
3mpMC3if3HNNj0RgsZxZA5PjqMn07KISgZ7vVLXuLYmS3WNq2tUqM2nLevaa6g3N
YTrOCbUWmfnvAfg9KiU0YebMn4SLHOeqNJEZyCd6Pbz6lclH4aQuOdKUSdg4F8rB
AsCH0LE8wVE=
=vy3L
-----END PGP SIGNATURE-----
More information about the SpamBayes
mailing list