[spambayes-dev] Evaluating a training corpus

Tim Peters tim.one at comcast.net
Sun Jun 8 19:00:17 EDT 2003


[Greg Ward]
> I'm mulling ways to evaluate the quality of a training corpus, and was
> wondering what the rest of you have tried.  My current technique is
> pretty bogus: train on the complete corpus, and then score every
> message in the corpus using the resulting database.  Obviously this
> is a self-fulfilling prophecy, but at least it highlights spam that
> are *really* different from other spam (and ditto for ham).

What is the purpose of testing for you?  A useful answer will contain at
least one number <wink>.

> I know there's code lurking in there somewhere (timcv.py?) for
> training on 90% of the corpus, and then evaluating the other 10%
> under the resulting database.

timcv does that 10 times (or N times, for whatever N you choose), training
on (N-1)/N of the messages and scoring the remaining 1/N of them.

> That got me to thinking: why not build a complete training database,
> and then do this:
>
>   foreach message:
>       remove message from database (ie. untrain)
>       score message
>       report score
>       put message back in database

That's what timcv does if you set N equal to the number of messages (M) in
the database.  In outline:

    partition the msgs into N groups, each with about M/N msgs
    foreach group:
        remove group from database
        score group
        report scores
        put group back in database

> That removes the "self-fulfilling prophecy bit", and the arbitrary
> nature of the 10%/90% selection.

M seems as arbitrary as 10 to me <wink>.

> But it should preserve the property of highlighting unusual spam or
> ham.  Seems to me like this should do a pretty good job of finding
> misclassified messages, at least.

I think most people have found that breaking the msgs into 10 groups does an
excellent job of finding misclassified msgs already.  When we were running
python.org tests, that's where my reports of misclassified msgs came from!
BTW, after getting the misclassifed msgs into the right classes, it's not
unusual to find more misclassified msgs by running it again.  Sometimes this
goes on for several iterations.

> Has anyone else tried something like this?  Is there code out there
> already?

timcv is a general "cross validation" driver (that's what "cv" stands for).
It's a standard statistical testing technique, and you can make N as large
(or small) as you like.  For purposes of predicting real-live behavior, pick
N so that (N-1)/N * M is about equal to the number of msgs you expect to
train on.  timcv then builds and tests N classifiers of that size, testing
each against the M/N withheld from it.




More information about the spambayes-dev mailing list