[spambayes-dev] Bug in timcv.py

Tue Dec 2 22:59:52 EST 2003

[Kenny Pitt]
> It looks like there is a bug in timcv.  I tried to run a test of
> training on only a small number of messages, and I got the following
> output.
>
> """
> C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain
> 10 --SpamTrain 10 --HamTest 150 --SpamTest 400  timcv_10-10.txt

Wow -- I didn't even know those options ({Ham,Spam}{Train,Test}) existed.
They warp the meaning of "cross validation" beyond my recognition, so I wish
they had been added to a new "cv-like" test driver instead.  Oh well.

> Traceback (most recent call last):
>    ...
>   File "C:\src\python\spambayes_exp\spambayes\TestDriver.py",
         line 266, in test
>     t.predict(spam, True, new_spam)
> ...
>   File "C:\src\python\spambayes\spambayes\classifier.py", line 242,
         in probability
>     assert hamcount <= nham
> AssertionError
> """

Ouch.

> I took a quick look at timcv.py, and I think I know what is happening.
> The ham and spam streams for initial training are created with
> "train=1",

Right.

> but the untrain() for the set being tested is done using streams that
> are created with "train=0".

Right.

> If the HamTrain/SpamTrain counts are different from the
> HamTest/SpamTest counts then the untrain() does not use the same
> set of messages.

This isn't cross-validation testing, so the optimizations in timcv.py *for*
true cv testing stopped making sense when these other options were added.

> I can, of course, work around this by setting
> build_each_classifier_from_scratch, but just wanted to let everyone
> know about the mismatch.

I'd rather see these options moved into a different test driver, leaving
timcv.py unsurprising again.  Since timcv.py is the primary driver for
serious testing, it should be kept as simple and bulletproof as possible.  I
regret that the build_each_classifier_from_scratch option was added to it
for the same reason (as the comments for that option say, there was a need
for that option at one time, when evaluating some since-rejected combining
schemes where *incremental* training and untraining were impossible; those
schemes went away, but the option stayed behind to muddy the waters).

> I noticed another curiosity in the traceback:  I ran the test from
> inside directory "C:\src\python\spambayes_exp", which contains my
> modified version of SpamBayes.  When the traceback gets to
> classifier.py, however, you can see that classifier.py was loaded from
> "C:\src\python\spambayes" instead, which is where I have my original
> CVS version of SpamBayes.  I don't have any PYTHONPATH environment
> variable set, and I don't know what else might cause it to jump paths
> like that. Can one of you more experienced python'ers explain this?

Run Python with -v to get a report of how every import got satisfied.  Then
stare until your eyes bleed <0.9 wink>.  I notice that a lot of the scripts
these days muck around with sys.path directly, thus changing Python's search
path dynamically, at runtime.  That's *usually* a Bad Idea.  If I were you,
I'd take a critical look at the fix_sys_path() function in
sb_test_support.py.  I don't know how this got so convoluted, but gobs of
dynamic code trying to "fix" what should be statically known (or at worst
fiddled once in a config file) is a pretty sure recipe for confusion.