[spambayes-dev] A spectacular false positive
Rob Hooft
rob at hooft.net
Sat Nov 15 04:27:57 EST 2003
Tim Peters wrote:
> I view that mostly as a danger of mistake-based training: as I've mentioned
> before, mistake-based training tends toward being hapax-driven, and hapaxes
> are brittle. There's nothing *inherently* spammy about, say, 16384, and
> because that's a power of 2 and I'm a computer geek, that *would* have
> appeared in several training ham if I hadn't fallen into mistake-based
> training (yes, 16384 had indeed appeared in one training spam).
I am now training on all mistakes and unsures, plus all ham scoring more
than 0.02 and all spam scoring less than 0.99. Total trained messages is
~250 both ways, and 97+ of spam scores 0.99+ leaving only 1-2 new spams
per day, less than 1 unsure per day, and ~1 new ham per day to train on.
I am really pleased by the performance of this training schedule. It is
not as brittle as mistake-based training, but it still ignores the
obvious repeating things like CVS log messages of which I receive a few
dozen per day. It keeps the database reasonably small, but not really
hapax driven.
Rob
--
Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/
More information about the spambayes-dev
mailing list