[Spambayes] "Difficult cases" archive?

Tim Peters tim.one@comcast.net
Thu, 12 Sep 2002 11:44:21 -0400


[Robert Oschler]
> Is there an archive of spam cases that have "stumped the filters"

If you check out the software from CVS and use it on some email, you'll
quickly get examples of false positives and false negatives.

> or "if filtered create too many false positives"?

This is a statistical approach, so no single msg can (well, should <wink>)
have a dramatic effect.

> I'd like to try my hand and at nuking those.  Again, newbie here.  If it's
> a dumb question then mea culpa.

Not a dumb question at all.  What I expect you'll find is that the false
negatives under this scheme are breathtakingly false -- screamingly obvious
spam, often long and chatty.  Graham's scoring scheme ignores almost all the
words in a msg, picking on just the about-a-dozen that have "probabilities"
farthest from 0.5.  This seems amazingly effective given enough training
data, but when it fails it's often a spectacular failure.  When you're got
100 clues with prob 0.01 and a hundred with 0.99, the scheme is just lost.
I made an improvement over Graham's original scoring scheme that reduced the
frequency of these gross errors, but they still happen.  A scheme to do
better would be most welcome!