[Spambayes] Re: suggestions for training and filtering?
jacob-spambayes-list at statisticalanomaly.com
Wed Dec 3 13:54:47 EST 2003
I started out with about 300 of each. I would always train on ham and
unsures, and I would delete the spam. However, as ham count in my
database grew, I would classify some additional spam messages to keep
the ratio even. When I did that, I tried to train on a block of about
100 messages (~3 days worth for me) at a time, so that I had a diverse
enough sample to avoid skewing my results.
Once I got to the point where most of my messages were being properly
sorted, I just started deleting the spam. To be honest, I still train
my unsures, but I get very, very few of them.
In addition, if I notice the number of unsures (or even messages that
should be spam being marked as ham), I'll start saving new spam and when
I have enough to be at about a 1:1 ratio with my saved ham, I'll nuke
the database and retrain it using the mail I've collected recently.
This system has worked out really well for me so far.
This has worked well for me so far.
Seth Goodman wrote:
> He says he isn't training at all anymore. My question for Jacob is what was
> the initial size of his training set and what were his criteria for training
> before he reached his present state?
More information about the Spambayes