[Spambayes] how spambayes handles image-only spams
Tim Peters
tim.one at comcast.net
Thu Sep 11 13:41:45 EDT 2003
[Bill Yerazunis]
>> What's your test protocol? I did "shuffle messages randomly,
>> but preserve knowledge of which class they were in, then
>> train with the first 90% and then test with the last 10%".
>> Repeat as needed...
[Tony Meyer]
> I did "rebal.py -n5", which IIRC is roughly equivalent to "shuffle
> messages randomly, but preserve knowledge of which class they were
> in". I then did "timtest.py -n5".
>
> I'm happy to admit I understand little of what the testing code does,
> just how to interpret (most of) the results that it gives me. This is
> one of the strengths of the spambayes testing suite, IMO (not that I
> have tried any other testing suites).
>
> The readme says that it does this:
> """
> Runs an NxN test grid, skipping the diagonal:
> N classifiers are built.
> N-1 runs are done with each classifier.
> Each classifier is trained on 1 set, and predicts against each of
> the N-1 remaining sets (those not used to train the
> classifier). """
>
> So in my case, I think this means that I train with the first 20%,
> then test with each of the remaining 20%s (and repeat). I may be
> wrong <wink>.
That's a good description of what it does. It's not the preferred way to
test, because it's hard to interpret the results, is slow (N**2-N test runs
are made), and it's brutal (in your case using -n5, each classifier built is
tested against 4x as many messages as it was trained on).
timtest.py is a more traditional cross-validation test driver, probably much
closer to what Bill is doing. It's easier to interpret the results, runs
faster, and will almost always deliver "better-looking results" than
timtest.py delivers, because the cross-validation driver trains on many more
messages than it tries to classify (the opposite is true of timtest.py).
More information about the Spambayes
mailing list