[Spambayes] Re: CRM114 in November breaks 99.9%. :-)
Robert Woodhead
trebor at animeigo.com
Tue Dec 3 14:28:10 2002
>However, if this is the approach Bill uses, you can't use to for
>performance estimates. Our speech and natural language group is
>very careful not to mix its training set with its test set. When
>they do, they do something like 10 fold cross validation which
>averages (?) the results of 10 experiments that take some random
>fraction of the data as training and the rest as testing.
ah, but the point is, since each individual user will have his own
email stream to train on, all you care about is how accurate the
system is when it looks at the very next email that comes in. Thus,
a system that gets very good after a few weeks of training on all the
incoming mail, AND STAYS THAT WAY, is what you want in the real world.
Dividing up training sets can be good for analysing the statistical
properties of particular algorithm choices, but what counts (in a
production environment) is real world performance, and real world
filters have to adapt as the spam (and ham) changes over time.
Tests like "pick a random sample, train on it, and then pick another
sample (nonintersecting) from the same corpus, and test" don't
properly reflect the real world environment. Spams are ordered by
time!
Thus, my philosophical position is that a real world app has to train
on every incoming email (and be corrected by the user when it goofs).
At 9:30 PM -0500 12/2/02, Bill Yerazunis wrote:
>The reason I haven't auto-trained is due to my lack of understanding
>on what the limiting amount of self-teaching one can allow that
>doesn't go off into belly gaze.
This cannot happen unless the user is derelict in not correcting the
output. If he is, then the input to the training system is 100%
correct. And if the training system has an aging system, correction
mistakes will eventually decay (and, if they cause
misclassifications, the user will notice and correct the filter).
Keep in mind there is always a new stream of incoming spam and ham to
work with.
R
--
===========================================================
Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/
===========================================================
http://selfpromotion.com/ The Net's only URL registration
SHARESERVICE. A power tool for power webmasters.
More information about the Spambayes
mailing list