kennypitt at hotmail.com
Fri Feb 27 13:47:08 EST 2004
Erin Lazzaro wrote:
> Training on mistakes and unsures seems the most intuitive, but since
> I have never yet had any ham classified as spam, I would expect the
> ratio to get very unbalanced. Why do people think otherwise? Do you
> start getting misclassified ham if the ratio gets too far out?
Training only on mistakes and unsures can certainly cause heavy
imbalance for some people. We've seen logfiles from users who have
100:1 imbalance or worse. It's possible that it will eventually cause
you to start getting misclassified ham, and at that point you can start
training those messages. If the imbalance doesn't cause you to get any
misclassifications then the imbalance isn't an issue.
The mathematics say that a perfect balance is best because otherwise
additional weight is given to the clues from one side or the other.
That's all theoretical, though, and what really matters is how it
behaves for you in practice. We've been kicking around some additional
theories about how we could automatically help you keep your training in
balance, but nobody has come up with a silver bullet yet.
More information about the Spambayes