[Spambayes] out of balance database
tim.one at comcast.net
Thu Dec 11 12:08:08 EST 2003
> what types of bad things happen when the database is "out of balance"?
Increased error rates, and sluggish response to new training.
Suppose you trained on 1000 spam and no ham. Then every token in the
database looks purely spammy, and no token in the database looks hammy. As
a result, no message will get scored as ham, and you'll get high false
positive and high Unsure rates. You won't get any false negatives, though.
Now add training on one ham. The situation will improve, but probably not
> is it in balance when i have the same number of messages in each pile?
> or when the total size of each pile is the same?
"Same number of messages" has been a good-enough approximation in practice.
It's possible to get into the same kinds of trouble if, e.g., you trained on
one spam containing a million words, and one ham containing a single word,
but that's not something to worry about in real life.
> how far out of balance is considered "reasonable"? how far out of
> balance can it get before i notice problems?
A sharp answer depends on your exact email mix, and exact training strategy.
Both differ across users. I start to see (minor) flakiness under my combo
if the ratio of messages trained on starts to exceed 2-to-1, although I've
been "happy enough" letting it slide up to 5-to-1; I'm not happy enough if
it gets worse than 5-to-1. Others here have reported no problems with
ratios up to 10-to-1. Some people using the Outlook addin have ratios
exceeding 300-to-1, but that's always something we *deduce* after they
complain about poor classification performance <wink>.
My seeded-with-200-of-each then trained-on-mistakes-and-some-unsures
database today has ... 437 ham and 456 spam, a little more than one day's
total email volume.
More information about the Spambayes