[spambayes-dev] Bug in timcv.py
Tim Peters
tim.one at comcast.net
Tue Dec 2 22:59:52 EST 2003
[Kenny Pitt]
> It looks like there is a bug in timcv. I tried to run a test of
> training on only a small number of messages, and I got the following
> output.
>
> """
> C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain
> 10 --SpamTrain 10 --HamTest 150 --SpamTest 400 timcv_10-10.txt
Wow -- I didn't even know those options ({Ham,Spam}{Train,Test}) existed.
They warp the meaning of "cross validation" beyond my recognition, so I wish
they had been added to a new "cv-like" test driver instead. Oh well.
> Traceback (most recent call last):
> ...
> File "C:\src\python\spambayes_exp\spambayes\TestDriver.py",
line 266, in test
> t.predict(spam, True, new_spam)
> ...
> File "C:\src\python\spambayes\spambayes\classifier.py", line 242,
in probability
> assert hamcount <= nham
> AssertionError
> """
Ouch.
> I took a quick look at timcv.py, and I think I know what is happening.
> The ham and spam streams for initial training are created with
> "train=1",
Right.
> but the untrain() for the set being tested is done using streams that
> are created with "train=0".
Right.
> If the HamTrain/SpamTrain counts are different from the
> HamTest/SpamTest counts then the untrain() does not use the same
> set of messages.
This isn't cross-validation testing, so the optimizations in timcv.py *for*
true cv testing stopped making sense when these other options were added.
> I can, of course, work around this by setting
> build_each_classifier_from_scratch, but just wanted to let everyone
> know about the mismatch.
I'd rather see these options moved into a different test driver, leaving
timcv.py unsurprising again. Since timcv.py is the primary driver for
serious testing, it should be kept as simple and bulletproof as possible. I
regret that the build_each_classifier_from_scratch option was added to it
for the same reason (as the comments for that option say, there was a need
for that option at one time, when evaluating some since-rejected combining
schemes where *incremental* training and untraining were impossible; those
schemes went away, but the option stayed behind to muddy the waters).
> I noticed another curiosity in the traceback: I ran the test from
> inside directory "C:\src\python\spambayes_exp", which contains my
> modified version of SpamBayes. When the traceback gets to
> classifier.py, however, you can see that classifier.py was loaded from
> "C:\src\python\spambayes" instead, which is where I have my original
> CVS version of SpamBayes. I don't have any PYTHONPATH environment
> variable set, and I don't know what else might cause it to jump paths
> like that. Can one of you more experienced python'ers explain this?
Run Python with -v to get a report of how every import got satisfied. Then
stare until your eyes bleed <0.9 wink>. I notice that a lot of the scripts
these days muck around with sys.path directly, thus changing Python's search
path dynamically, at runtime. That's *usually* a Bad Idea. If I were you,
I'd take a critical look at the fix_sys_path() function in
sb_test_support.py. I don't know how this got so convoluted, but gobs of
dynamic code trying to "fix" what should be statically known (or at worst
fiddled once in a config file) is a pretty sure recipe for confusion.
More information about the spambayes-dev
mailing list