[spambayes-dev] Bug in timcv.py

Kenny Pitt kennypitt at hotmail.com
Tue Dec 2 14:30:44 EST 2003


It looks like there is a bug in timcv.  I tried to run a test of
training on only a small number of messages, and I got the following
output.

"""
C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain 10
--SpamTrain 10 --HamTest 150 --SpamTest 400  timcv_10-10.txt
Traceback (most recent call last):
  File "timcv.py", line 170, in ?
    main()
  File "timcv.py", line 167, in main
    drive(nsets)
  File "timcv.py", line 115, in drive
    d.test(hamstream, spamstream)
  File "C:\src\python\spambayes_exp\spambayes\TestDriver.py", line 266,
in test
    t.predict(spam, True, new_spam)
  File "C:\src\python\spambayes_exp\spambayes\Tester.py", line 92, in
predict
    prob = guess(example)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 158, in
chi2_spamprob
    clues = self._getclues(wordstream)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 395, in
_getclues
    prob = self.probability(record)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 242, in
probability
    assert hamcount <= nham
AssertionError
"""

I took a quick look at timcv.py, and I think I know what is happening.
The ham and spam streams for initial training are created with
"train=1", but the untrain() for the set being tested is done using
streams that are created with "train=0".  If the HamTrain/SpamTrain
counts are different from the HamTest/SpamTest counts then the untrain()
does not use the same set of messages.  I can, of course, work around
this by setting build_each_classifier_from_scratch, but just wanted to
let everyone know about the mismatch.

I noticed another curiosity in the traceback:  I ran the test from
inside directory "C:\src\python\spambayes_exp", which contains my
modified version of SpamBayes.  When the traceback gets to
classifier.py, however, you can see that classifier.py was loaded from
"C:\src\python\spambayes" instead, which is where I have my original CVS
version of SpamBayes.  I don't have any PYTHONPATH environment variable
set, and I don't know what else might cause it to jump paths like that.
Can one of you more experienced python'ers explain this?

-- 
Kenny Pitt




More information about the spambayes-dev mailing list