[spambayes-dev] Evaluating a training corpus

Greg Ward gward at python.net
Sun Jun 8 17:47:52 EDT 2003


I'm mulling ways to evaluate the quality of a training corpus, and was
wondering what the rest of you have tried.  My current technique is
pretty bogus: train on the complete corpus, and then score every message
in the corpus using the resulting database.  Obviously this is a
self-fulfilling prophecy, but at least it highlights spam that are
*really* different from other spam (and ditto for ham).

I know there's code lurking in there somewhere (timcv.py?) for training
on 90% of the corpus, and then evaluating the other 10% under the
resulting database.  That got me to thinking: why not build a complete
training database, and then do this:

  foreach message:
      remove message from database (ie. untrain)
      score message
      report score
      put message back in database

That removes the "self-fulfilling prophecy bit", and the arbitrary
nature of the 10%/90% selection.  But it should preserve the property of
highlighting unusual spam or ham.  Seems to me like this should do a
pretty good job of finding misclassified messages, at least.

Has anyone else tried something like this?  Is there code out there
already?

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD
STOP HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP



More information about the spambayes-dev mailing list