[Spambayes] SpamBayes now filers less than 50% of my spam.

Sun Nov 16 00:24:41 EST 2003

[Ryan Malayter]
> ...
> The way the probabilities are actually computed, the more data you
> have, the more accurate your probabilities get, and the better the
> filter will perform.

If the measured statistical properties of ham and spam never changed over
time, that would be true.  A possible problem in practice is that the true
probabilities do change some over time, and then the more *stale* training
data you're carrying forward, the worse the math estimates current
probabilities.  That's one reason to favor a small database in practice:  it
responds more quickly to new training; it doesn't have so much of the past
pointing it in out-of-date directions.

> Up to a point, of course... there will always be diminsighing returns.
> There's not much difference in practical terms between 99.7% accuracy
> and 99.8% accuracy.

That depends on what you're doing, and expressing those as error rates makes
the issue clearer:  a .3% error rate is 50% larger than a .2% error rate.
This project was originally aimed at filtering high-volume tech mailing
lists, and when you're dealing with 10s or 100s of thousands of emails per
day, an absolute .1% increase in the error rate can translate to hundreds of
additional messages kicked out for moderator review every day.  If the
comp.lang.python tests at the time had been able to achieve *only* 99.9%
accuracy, I probably would have given up.

That said, I agree that for personal email, for most people (those who don't
get thousands of emails per day) a difference of 0.1% in the error rate will
be noticed only by the self-destructively obsessed <wink>.