[Spambayes] RE: Routine training on correctly classified email?

Tim Peters tim.one at comcast.net
Sun Dec 7 20:15:22 EST 2003

[Robert K. Coe]
> The problem with mistake-based training is that almost all mistakes
> are false negatives. And most of the messages that go to the
> "Indefinite" folder turn out to be spam.

I'll accept that both are true for your email so far, but there's no basis
for assuming it's true of everyone.  As a counterexample, most of my unsures
over the last month have been ham.  Different people get different kinds of
email.  My ham unsures lately usually come from the private python-help
mailing list, where a wide variety of people I've never heard of before send
questions on everything from Python through pythons to Monty Python.  Some
are barely able to write English (some don't even try), and many use free
email accounts with auto-inserted ads at the bottom.  That *is* ham to me,
and I'm sure this same kind of unfocused mish-mash floods the inbox of
anyone at the receiving end of a public admin or help-desk address.

SpamBayes probably isn't optimal for my email mix, but I'm not going to
change it to favor mine at the expense of yours.  By the same token, I'm not
going to change it to favor yours at the expense of mine.  We've so far
stuck to very general algorithms that strive to favor nothing.  Qualitative
results on any specific email mix will and do vary, and generalizing from
one's own particular mix never leads to a truth.

> The result is that over time, the database becomes increasingly
> spam-heavy.

This hasn't been tested properly, but I agree the bulk of self-selected
reports have so far seemed to have an email mix more like yours than like

> This in turn degrades the reliability of the algorithm, according to the
> accepted wisdom.

That has been tested properly, and imbalance does hurt results with this

> Obviously this doesn't constitute "definitive proof" that automatic
> training would be better. But it does argue for giving it a try.


