[Spambayes] More "spam of the future" lately?
skip at pobox.com
Tue Dec 16 16:56:45 EST 2003
>> The problem is, all of these seem to be slipping by my trained
>> SpamBayes, scoring 10% or less.
Tim> Why? Look at the spam clues. There has to be something decidely
Tim> hammy about them to score that low, and a collection of random
Tim> words isn't decidedly hammy except by accident. There must be more
Tim> to it. If they're managing to hit something *systematically* hammy
Tim> for you, then continued training will make whatever that is stop
Tim> looking hammy to you.
Based on my own personal experience, I always consider "pilot error" as one
of the first possible causes of such problems. It occurs to me that a
simple script (or a database parallel to the training database) which maps
tokens to lists of spam/ham message ids instead of just message counts might
be helpful in tracking down such mistakes. Instead of executing
db = shelve.open("hammie.db")
I might execute
db = shelve.open("hammie-msgids.db")
[["spam-msgid1", "spam-msgid2"], ["ham-msgid1", ..., "ham-msgid12"]]
thus allowing me to more easily locate the spuriously trained ham messages
which are the source of the "url:biz" token.
More information about the Spambayes