[Spambayes] Re: CRM114 in November breaks 99.9%. :-)

Robert Woodhead trebor at animeigo.com
Tue Dec 3 14:28:10 2002


>However, if this is the approach Bill uses, you can't use to for 
>performance estimates.  Our speech and natural language group is 
>very careful not to mix its training set with its test set.  When 
>they do, they do something like 10 fold cross validation which 
>averages (?) the results of 10 experiments that take some random 
>fraction of the data as training and the rest as testing.

ah, but the point is, since each individual user will have his own 
email stream to train on, all you care about is how accurate the 
system is when it looks at the very next email that comes in.  Thus, 
a system that gets very good after a few weeks of training on all the 
incoming mail, AND STAYS THAT WAY, is what you want in the real world.

Dividing up training sets can be good for analysing the statistical 
properties of particular algorithm choices, but what counts (in a 
production environment) is real world performance, and real world 
filters have to adapt as the spam (and ham) changes over time.

Tests like "pick a random sample, train on it, and then pick another 
sample (nonintersecting) from the same corpus, and test" don't 
properly reflect the real world environment.  Spams are ordered by 
time!

Thus, my philosophical position is that a real world app has to train 
on every incoming email (and be corrected by the user when it goofs).

At 9:30 PM -0500 12/2/02, Bill Yerazunis wrote:
>The reason I haven't auto-trained is due to my lack of understanding
>on what the limiting amount of self-teaching one can allow that
>doesn't go off into belly gaze.

This cannot happen unless the user is derelict in not correcting the 
output.  If he is, then the input to the training system is 100% 
correct.  And if the training system has an aging system, correction 
mistakes will eventually decay (and, if they cause 
misclassifications, the user will notice and correct the filter).

Keep in mind there is always a new stream of incoming spam and ham to 
work with.

R

-- 
===========================================================
Robert Woodhead, CEO, AnimEigo     http://www.animeigo.com/
===========================================================
http://selfpromotion.com/   The Net's only URL registration
SHARESERVICE.  A power tool for power webmasters.



More information about the Spambayes mailing list