[spambayes-dev] Pickle vs DB inconsistencies
Greg Ward
gward at python.net
Thu Jun 12 23:01:46 EDT 2003
No wait, false alarm on the false alarm. I *am* still seeing
inconsistent behaviour between pickle and DB stores, only this time I
have to score/untrain/score/retrain/score to show the difference.
Here's the code I'm using (same as I posted earlier this evening):
"""
import sys
from spambayes import hammie
from spambayes import tokenizer
def score(tokens, label):
(prob, clues) = bayes.spamprob(tokens, True)
high_clues = ["%s:%.3f" % clue for clue in clues[-5:]]
high_clues = ", ".join(high_clues)
print "%s: %.3f: %s" % (label, prob, high_clues)
(db_filename, msg_filename) = sys.argv[1:]
usedb = db_filename.endswith(".db") # assume pickle otherwise
hammie = hammie.open(db_filename, usedb=usedb, mode="w")
bayes = hammie.bayes
# Read and tokenize message (which must be spam)
message = open(msg_filename).read()
tokens = list(tokenizer.tokenize(message))
# Score with that message (presumably) in the database.
score(tokens, "initial")
# Untrain (ie. remove this message from the database) and score again
# (this is where we assume the message is spam).
bayes.unlearn(tokens, True)
score(tokens, "unlearn")
# Retrain and score one last time. Should give identical results
# to the initial scoring... but doesn't!
bayes.learn(tokens, True)
score(tokens, "relearn")
"""
First, let's score/untrain/score/retrain/score the same message with two
copies of the same training database (one pickle, one DB) (sorry about
the long lines):
$ ./simplescore db/default.db $msg
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.482: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
$ ./simplescore db/default.pkl $msg
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.272: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
I see two problems here:
* untraining on this message gives a different score for the pickle
and DB store
* after retraining, the score with the DB store is not the same as the
initial score (or with the retrained pickle store)
Now let's repeat the experiment:
$ ./simplescore db/default.db $msg
initial: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.482: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
$ ./simplescore db/default.pkl $msg
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.272: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
Still getting inconsistent results after untraining. At least the DB
store has settled down and gives the same results initially and after
retraining. Too bad it's inconsistent with the pickle store! ;-(
*Now* WTF is going on?
Greg
--
Greg Ward <gward at python.net> http://www.gerg.ca/
This message transmitted with 100% recycled electrons.
More information about the spambayes-dev
mailing list