[spambayes-dev] Re: Pickle vs DB inconsistencies

Tue Jun 24 18:49:26 EDT 2003

On 12 June 2003, I said:
> I'm getting inconsistent results using the same training corpus when I
> store the database to a pickle vs a DB file.  Here's how I created the
> training databases (once DB, once pickle):

I've spent most of this afternoon digging into this, and I am utterly,
completely, totally, absolutely STUMPED.  I need help.  Here's my test
setup:

  * corpus with 10 spam, 10 ham
  * train twice to "10.db" and "10.pkl" -- the former a Berkeley DB
    file, the latter a pickle

I'm having at least two problems, both related to
score/unlearn/score/relearn/score on a single message.  Specifically:

  * the second score (post unlearning) is different for the two
    storages: the DB file scores my test message 1.000 after removing
    it from the training database, and the pickle file scores it 0.056.
    (The message in question is a spam that doesn't look like other
    spam, so the results with the pickle file make more sense to me.)

  * at the end of the full cycle, the pickle database is unchanged
    (verified by diff'ing the output of dbExpImp), but the DB storage
    is different.  Specifically, it appears that the spam count of
    every token in the unlearned/relearned message is incremented by
    one.

The final clue: this weird behaviour only happens using my 'simplescore'
script (which I'll attach) (and yes, it is simple).  If I
score/unlearn/score/relearn/score with five distinct invocations of
hammie.py, things appear to work just fine, and I get identical results
with the pickle and DB storages.  I've scoured my simplescore script to
see if there's anything screwy there, but I sure can't see it.  Either
it needs a second pair of eyeballs, or there's something wrong with
untraining/retraining on a message in a DB storage within the same
process.

So, if you have a minute, could you look over the attached simplescore
script and see if it looks sane to you, ie. does it unlearn/relearn in
the correct way?  (Note that there are two implementations in there: one
that tokenizes the message only once and uses slightly under-the-hood
calls, and the other that uses top-level calls and tokenizes the message
many times.  I get identical results with both versions.)

And if you have ten minutes, could you download

  http://www.gerg.ca/spambayes-test-unlearn.tar.gz

and unpack it, and then try

  cd test
  python simplescore 10.db spam/cur/19S0vv-0003B0-00:2,S
  python simplescore 10.pkl spam/cur/19S0vv-0003B0-00:2,S

and see if *you* can figure out what the hell is going on.  Note that
'save/' contains copies of 10.db and 10.pkl as originally trained.

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
No man is an island, but some of us are long peninsulas.
-------------- next part --------------
#!/usr/bin/env python2.2

import sys
from spambayes import hammie
from spambayes import tokenizer

def score(hammie, tokens, label):
    (prob, clues) = hammie.bayes.spamprob(tokens, True)
    high_clues = ["%s:%.3f" % clue for clue in clues[-5:]]
    high_clues = ", ".join(high_clues)
    print "%s: %.3f: %s" % (label, prob, high_clues)

def allscores(hammie, message):
    # Tokenize message once
    bayes = hammie.bayes
    tokens = list(tokenizer.tokenize(message))

    # Score with that message (presumably) in the database.
    score(hammie, tokens, "initial")

    # Untrain (ie. remove this message from the database) and score again
    # (assume the message is spam).
    bayes.unlearn(tokens, True)
    score(hammie, tokens, "unlearn")

    # Retrain and score one last time.  Should give identical results
    # to the initial scoring... but doesn't!
    bayes.learn(tokens, True)
    score(hammie, tokens, "relearn")

# def score(hammie, msg, label):
#     (prob, clues) = hammie.score(msg, True)
#     high_clues = ["%s:%.3f" % clue for clue in clues[-5:]]
#     high_clues = ", ".join(high_clues)
#     print "%s: %.3f: %s" % (label, prob, high_clues)

# def allscores(hammie, message):
#     score(hammie, message, "initial")

#     hammie.untrain(message, True)
#     score(hammie, message, "unlearn")

#     hammie.train(message, True)
#     score(hammie, message, "relearn")

def main():
    args = sys.argv[1:]
    if len(args) != 2:
        sys.exit("usage: simplescore db_file msg_file")

    (db_filename, msg_filename) = sys.argv[1:]
    usedb = db_filename.endswith(".db")  # assume pickle otherwise
    h = hammie.open(db_filename, usedb=usedb, mode="w")

    message = open(msg_filename).read()
    allscores(h, message)

    h.bayes.store()

main()