[spambayes-dev] Evaluating a training corpus

Tim Peters tim.one at comcast.net
Sun Jun 8 18:42:50 EDT 2003


[Meyer, Tony]
> Which is preferred, timtest or timcv?  The readme has:
>     [timcv] is the preferred way to test when possible:

So which part of "preferred" is unclear there <wink>?

> ...
> But also:
>     [timtest] is a much harder test than timcv, because it trains on
> N-1 times
>     less data, and makes each classifier predict against N-1 times
>     more data than it's been taught about.
> And I would have thought that a harder test was a better test.  (I
> presume that if I understood more statistics I could answer this
> myself...).

Carry it to an extreme:  train on 1 ham and 1 spam, then score a million
msgs against that 2-message database.  That's as hard as it gets, but
unlikely to be predictive of real-life usage.  If you have a thousand msgs
in your database, and score against 100 per day, then timcv is quite close
to real-life per-day usage.  If you have 100 msgs in your database, and
score against 1000 per day, then timtest is closer, but also harder to make
sense of since each msg is scored N-1 different times (by each of N-1
different classifiers).  timcv scores each msg exactly once, so is easier to
make sense of.  Pick your poison accordingly.




More information about the spambayes-dev mailing list