[spambayes-dev] Bug in timcv.py
Kenny Pitt
kennypitt at hotmail.com
Tue Dec 2 14:30:44 EST 2003
It looks like there is a bug in timcv. I tried to run a test of
training on only a small number of messages, and I got the following
output.
"""
C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain 10
--SpamTrain 10 --HamTest 150 --SpamTest 400 timcv_10-10.txt
Traceback (most recent call last):
File "timcv.py", line 170, in ?
main()
File "timcv.py", line 167, in main
drive(nsets)
File "timcv.py", line 115, in drive
d.test(hamstream, spamstream)
File "C:\src\python\spambayes_exp\spambayes\TestDriver.py", line 266,
in test
t.predict(spam, True, new_spam)
File "C:\src\python\spambayes_exp\spambayes\Tester.py", line 92, in
predict
prob = guess(example)
File "C:\src\python\spambayes\spambayes\classifier.py", line 158, in
chi2_spamprob
clues = self._getclues(wordstream)
File "C:\src\python\spambayes\spambayes\classifier.py", line 395, in
_getclues
prob = self.probability(record)
File "C:\src\python\spambayes\spambayes\classifier.py", line 242, in
probability
assert hamcount <= nham
AssertionError
"""
I took a quick look at timcv.py, and I think I know what is happening.
The ham and spam streams for initial training are created with
"train=1", but the untrain() for the set being tested is done using
streams that are created with "train=0". If the HamTrain/SpamTrain
counts are different from the HamTest/SpamTest counts then the untrain()
does not use the same set of messages. I can, of course, work around
this by setting build_each_classifier_from_scratch, but just wanted to
let everyone know about the mismatch.
I noticed another curiosity in the traceback: I ran the test from
inside directory "C:\src\python\spambayes_exp", which contains my
modified version of SpamBayes. When the traceback gets to
classifier.py, however, you can see that classifier.py was loaded from
"C:\src\python\spambayes" instead, which is where I have my original CVS
version of SpamBayes. I don't have any PYTHONPATH environment variable
set, and I don't know what else might cause it to jump paths like that.
Can one of you more experienced python'ers explain this?
--
Kenny Pitt
More information about the spambayes-dev
mailing list