[spambayes-dev] Evaluating a training corpus
Greg Ward
gward at python.net
Thu Jun 12 21:56:18 EDT 2003
[me, last Sunday]
> I'm mulling ways to evaluate the quality of a training corpus, and was
> wondering what the rest of you have tried.
[Tim's response]
> What is the purpose of testing for you? A useful answer will contain at
> least one number <wink>.
Primary purpose is to find misclassified messages. Secondary purpose is
to give me a warm fuzzy feeling that Spambayes is #1 (ie. correctly
classifies mail that it has not been trained on). (There, I got a
number in.)
> timcv does that 10 times (or N times, for whatever N you choose), training
> on (N-1)/N of the messages and scoring the remaining 1/N of them.
[...]
> That's what timcv does if you set N equal to the number of messages (M) in
> the database. In outline:
Hmmm, OK. I guess I could use timcv.py then, but rearranging my 18
corpora into 10 directories each would is a bit inconvenient. So I
tried an end-run around timcv.py by modifying my scoring script to
untrain, score, and retrain. Here's a simplified version:
"""
import sys
from spambayes import hammie
from spambayes import tokenizer
def score(tokens, label):
(prob, clues) = bayes.spamprob(tokens, True)
high_clues = ["%s:%.3f" % clue for clue in clues[-5:]]
high_clues = ", ".join(high_clues)
print "%s: %.3f: %s" % (label, prob, high_clues)
(db_filename, msg_filename) = sys.argv[1:]
hammie = hammie.open(db_filename, mode="w")
bayes = hammie.bayes
# Read and tokenize message (which must be spam)
message = open(msg_filename).read()
tokens = list(tokenizer.tokenize(message))
# Score with that message (presumably) in the database.
score(tokens, "initial")
# Untrain (ie. remove this message from the database) and score again
# (this is where we assume the message is spam).
bayes.unlearn(tokens, True)
score(tokens, "unlearn")
# Retrain and score one last time. Should give identical results
# to the initial scoring... but doesn't!
bayes.learn(tokens, True)
score(tokens, "relearn")
"""
...does that look correct? It seems to work with a pickle store, but
I'm getting weird results with a DB store. I think that's another issue
though -- see my next post...
Greg
--
Greg Ward <gward at python.net> http://www.gerg.ca/
Jesus Saves -- and you can too, by redeeming these valuable coupons!
More information about the spambayes-dev
mailing list