[spambayes-dev] Pickle vs DB inconsistencies

Thu Jun 12 23:01:46 EDT 2003

No wait, false alarm on the false alarm.  I *am* still seeing
inconsistent behaviour between pickle and DB stores, only this time I
have to score/untrain/score/retrain/score to show the difference.

Here's the code I'm using (same as I posted earlier this evening):

"""
import sys 
from spambayes import hammie 
from spambayes import tokenizer 

def score(tokens, label): 
    (prob, clues) = bayes.spamprob(tokens, True) 
    high_clues = ["%s:%.3f" % clue for clue in clues[-5:]] 
    high_clues = ", ".join(high_clues) 
    print "%s: %.3f: %s" % (label, prob, high_clues) 

(db_filename, msg_filename) = sys.argv[1:] 
usedb = db_filename.endswith(".db")  # assume pickle otherwise 
hammie = hammie.open(db_filename, usedb=usedb, mode="w") 
bayes = hammie.bayes 

# Read and tokenize message (which must be spam) 
message = open(msg_filename).read() 
tokens = list(tokenizer.tokenize(message)) 

# Score with that message (presumably) in the database. 
score(tokens, "initial") 

# Untrain (ie. remove this message from the database) and score again 
# (this is where we assume the message is spam). 
bayes.unlearn(tokens, True) 
score(tokens, "unlearn") 

# Retrain and score one last time.  Should give identical results 
# to the initial scoring... but doesn't! 
bayes.learn(tokens, True) 
score(tokens, "relearn") 
"""

First, let's score/untrain/score/retrain/score the same message with two
copies of the same training database (one pickle, one DB) (sorry about
the long lines):

$ ./simplescore db/default.db $msg 
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.482: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994

$ ./simplescore db/default.pkl $msg
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.272: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994

I see two problems here:

  * untraining on this message gives a different score for the pickle
    and DB store

  * after retraining, the score with the DB store is not the same as the
    initial score (or with the retrained pickle store)

Now let's repeat the experiment:

$ ./simplescore db/default.db $msg 
initial: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.482: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 1.000: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994

$ ./simplescore db/default.pkl $msg
initial: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
unlearn: 0.272: volume:0.986, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994
relearn: 0.995: volume:0.987, hands:0.988, materials:0.991, 24/7:0.994, purchase:0.994

Still getting inconsistent results after untraining.  At least the DB
store has settled down and gives the same results initially and after
retraining.  Too bad it's inconsistent with the pickle store!  ;-(

*Now* WTF is going on?

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
This message transmitted with 100% recycled electrons.