[Spambayes] More "spam of the future" lately?

Tue Dec 16 16:56:45 EST 2003

    >> The problem is, all of these seem to be slipping by my trained
    >> SpamBayes, scoring 10% or less.

    Tim> Why?  Look at the spam clues.  There has to be something decidely
    Tim> hammy about them to score that low, and a collection of random
    Tim> words isn't decidedly hammy except by accident.  There must be more
    Tim> to it.  If they're managing to hit something *systematically* hammy
    Tim> for you, then continued training will make whatever that is stop
    Tim> looking hammy to you.

Based on my own personal experience, I always consider "pilot error" as one
of the first possible causes of such problems.  It occurs to me that a
simple script (or a database parallel to the training database) which maps
tokens to lists of spam/ham message ids instead of just message counts might
be helpful in tracking down such mistakes.  Instead of executing

    db = shelve.open("hammie.db")
    print db["url:biz"]

and getting

    (2, 12)

I might execute

    db = shelve.open("hammie-msgids.db")
    print db["url:biz"]

and get

    [["spam-msgid1", "spam-msgid2"], ["ham-msgid1", ..., "ham-msgid12"]]

thus allowing me to more easily locate the spuriously trained ham messages
which are the source of the "url:biz" token.

Skip