[spambayes-dev] Evaluating a training corpus
Meyer, Tony
T.A.Meyer at massey.ac.nz
Mon Jun 9 10:26:40 EDT 2003
> mboxtest.py is probably the easiest to get going. I think
> timcv.py gives better results but it's a little more trouble
> to setup your test data. See README.txt for a short
> explanation of the tools. If you want to use timcv.py, you
> can use splitndirs.py to create the test data.
Which is preferred, timtest or timcv? The readme has:
[timcv] is the preferred way to test when possible: it
makes best use of limited data, and interpreting results is
straightforward.
But also:
[timtest] is a much harder test than timcv, because it trains on N-1
times
less data, and makes each classifier predict against N-1 times
more data than it's been taught about.
And I would have thought that a harder test was a better test. (I
presume that if I understood more statistics I could answer this
myself...).
=Tony Meyer
More information about the spambayes-dev
mailing list