[Spambayes] spambayes fronting a mailing list?

Thu Jan 16 22:06:16 EST 2003

[Mark Hammond]
> ...
> If we have a decent framework in place, then "obvious" spam would be
> anything that is spam given complete data.

That's not how I meant it.  "Obvious" is a human judgment, and is (AFAICT)
subjective.  Purely mistake-based training, starting from an empty database,
left substantial "obvious spam" in the Unsure category even after 2 weeks,
which is well over 1200 spam at the rate I get spam.  So little spam got
trained on during that time (there weren't many mistakes after the first two
days) that spam-detection remained mostly hapax-driven, and the few
instances of trained farm-porn spam didn't do enough to nail gay-porn spam
too, etc.

"Obvious spam" means that you personally are surprised to see it rate
Unsure, at least surprised enough to click the "Spam Clues" button to try to
figure out why it wasn't nailed.

> ie, assume we have 3000 ham and 3000 spam.  My training strategy
> would be to perform a complete train over the entire database, and
> collect "correct" scores for each item.

I'm not sure what correct means here.  How do you decide?  You're surely not
going to look at those 6,000 msgs by hand and assign a two-digit number to
each, right?

> We then can test out various training strategies, watching not only
> the fp/fn/unsure rates, but also deviance from the "correct" score.
> ...
> Do you believe we can reasonable formalize some tests for these
> strategies?

If you can define what it is you're trying to measure, sure <wink>.  All
along in testing we used a three-term cost function (assigning different
"dollar" penalties to FP, FN and unsure), and the measure of goodness was
how small the total penalty got.  It's easy (albeit tedious) to set up
experiments to measure the effect of any definable training strategy on
that.  If you define a different penalty function, likewise.