[spambayes-dev] Evaluating a training corpus

Thu Jun 12 21:56:18 EDT 2003

[me, last Sunday]
> I'm mulling ways to evaluate the quality of a training corpus, and was
> wondering what the rest of you have tried.

[Tim's response]
> What is the purpose of testing for you?  A useful answer will contain at
> least one number <wink>.

Primary purpose is to find misclassified messages.  Secondary purpose is
to give me a warm fuzzy feeling that Spambayes is #1 (ie. correctly
classifies mail that it has not been trained on).  (There, I got a
number in.)

> timcv does that 10 times (or N times, for whatever N you choose), training
> on (N-1)/N of the messages and scoring the remaining 1/N of them.
[...]
> That's what timcv does if you set N equal to the number of messages (M) in
> the database.  In outline:

Hmmm, OK.  I guess I could use timcv.py then, but rearranging my 18
corpora into 10 directories each would is a bit inconvenient.  So I
tried an end-run around timcv.py by modifying my scoring script to
untrain, score, and retrain.  Here's a simplified version:

"""
import sys 
from spambayes import hammie 
from spambayes import tokenizer 

def score(tokens, label): 
    (prob, clues) = bayes.spamprob(tokens, True) 
    high_clues = ["%s:%.3f" % clue for clue in clues[-5:]] 
    high_clues = ", ".join(high_clues) 
    print "%s: %.3f: %s" % (label, prob, high_clues) 

(db_filename, msg_filename) = sys.argv[1:] 
hammie = hammie.open(db_filename, mode="w") 
bayes = hammie.bayes 

# Read and tokenize message (which must be spam) 
message = open(msg_filename).read() 
tokens = list(tokenizer.tokenize(message)) 

# Score with that message (presumably) in the database. 
score(tokens, "initial") 

# Untrain (ie. remove this message from the database) and score again 
# (this is where we assume the message is spam). 
bayes.unlearn(tokens, True) 
score(tokens, "unlearn") 

# Retrain and score one last time.  Should give identical results 
# to the initial scoring... but doesn't! 
bayes.learn(tokens, True) 
score(tokens, "relearn") 
"""

...does that look correct?  It seems to work with a pickle store, but
I'm getting weird results with a DB store.  I think that's another issue
though -- see my next post...

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
Jesus Saves -- and you can too, by redeeming these valuable coupons!