[spambayes-dev] Evaluating a training corpus
Greg Ward
gward at python.net
Sun Jun 8 17:47:52 EDT 2003
I'm mulling ways to evaluate the quality of a training corpus, and was
wondering what the rest of you have tried. My current technique is
pretty bogus: train on the complete corpus, and then score every message
in the corpus using the resulting database. Obviously this is a
self-fulfilling prophecy, but at least it highlights spam that are
*really* different from other spam (and ditto for ham).
I know there's code lurking in there somewhere (timcv.py?) for training
on 90% of the corpus, and then evaluating the other 10% under the
resulting database. That got me to thinking: why not build a complete
training database, and then do this:
foreach message:
remove message from database (ie. untrain)
score message
report score
put message back in database
That removes the "self-fulfilling prophecy bit", and the arbitrary
nature of the 10%/90% selection. But it should preserve the property of
highlighting unusual spam or ham. Seems to me like this should do a
pretty good job of finding misclassified messages, at least.
Has anyone else tried something like this? Is there code out there
already?
Greg
--
Greg Ward <gward at python.net> http://www.gerg.ca/
OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD
STOP HE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP
More information about the spambayes-dev
mailing list