[Spambayes] Re: CRM114 in November breaks 99.9%. :-)

Bill Yerazunis wsy at merl.com
Tue Dec 3 02:30:46 2002


   X-Sender: trebor@mail.animeigo.com
   Date: Mon, 2 Dec 2002 17:35:36 -0500
   From: Robert Woodhead <trebor@animeigo.com>
   Cc: spamfilt@archub.org, spambayes@python.org
   X-Spam-Status: No, hits=-14.9 required=7.0
	   tests=IN_REP_TO,REFERENCES,SIGNATURE_SHORT_DENSE,
		 SPAM_PHRASE_01_02,SUBJECT_MONTH,SUBJECT_MONTH_2
	   version=2.41
   X-Spam-Level: 

   At 11:04 AM -0500 12/2/02, Ken Anderson wrote:
   >The "train only on errors" bothers me.  Can you say what you use for 
   >a training set and what you use for a test set?

Training a particular incarnation of CRM114 usually takes a week or
two; I read my mail (both categories) and when I find a piece of mail
misclassified, I train that one piece into the filter.

After a couple of days the errors get very sparse; after two or three
weeks, I "go for data" and that's what gets reported in the monthlies.

The current spam.css files are pretty much based on the live spam
errors in the first week of October; since only four spam came through
in all of November and only two were worth training on (the Czech
Diesel Parts spam was just too funny to train out), the .css files
are pretty much unchanged.

   Yeah, have you considered training on everything?  That is to say, 
   have CRM classify an email, assume it is correct, and train on it. 
   Then, if an email comes through as false positive or negative (an 
   error), you tell CRM to untrain on that email only.

I did put in that capability as a flag called "refute".  You can say

  learn < refute > ( spamfile.css ) /[[:graph:]]/

to unlearn something as nonspam, and then you can relearn it in the
proper category, but except for testing code paths, I've never
actually used it.

On the other hand, there's an old difficulty in AI that one of my 
teachers called "the Kalman Belly Gaze".  If you let a filter
(of any type, he was teaching Kalman filters at the time but 
it applies to any trained filter) learn on it's own output stream,
it quickly reinforces it's own behavior to the exclusion of all
else (i.e. it goes off and gazes at it's own navel, simply ignoring
the reality of the world around it).

The reason I haven't auto-trained is due to my lack of understanding
on what the limiting amount of self-teaching one can allow that
doesn't go off into belly gaze.

      -Bill Yerazunis



More information about the Spambayes mailing list