[spambayes-dev] A spectacular false positive

Mon Nov 17 10:42:23 EST 2003

Tim Peters wrote:
> Sigh -- we need solid research on training disciplines that work
> great in real-life use, respecting that anything requiring human
> input will barely get used except by geeks who never tire of watching
> the training process. We're getting a lot of anecdotal evidence
> (which ain't the same thing) about different schemes, and I'm afraid
> no two of the developers train in the same way anymore.  It's a good
> thing the algorithm appears to have turned out to be robust against
> almost any training insanity short of what Outlook users can stumble
> into <0.9 wink>. 

Yes, the Outlook plugin pretty much guarantees mistake-based training
for anyone not familiar enough with the program (or too lazy <wink>) to
update the training through SpamBayes Manager periodically.  The
majority of my ham comes either from the same list of senders at work,
or from the SpamBayes lists, so didn't take SpamBayes long to start
classifying all of those correctly.  I got up to almost 10:1 spam to ham
ratio pretty quickly.

To try to work around the problem, I implemented two experimental
options to train on all certain ham and train on all certain spam.
Since I can turn them on or off independently, I can use them to get my
ratio back in balance and then turn them off.  What I'd like to
implement is a way to do this automatically.  I'd like to say something
like, "If my spam count reaches twice my ham count then train on all
certain hams until the counts are within 5% of each other again."  These
cutoffs would of course be configurable.

It will take me a little while to get around to implementing this and
even longer to see if it is effective, but I'll report results (or at
least perceptions) when I have them.

-- 
Kenny Pitt