[Spambayes] Training on unusual ham - revisited

Coe, Bob rcoe at CambridgeMA.GOV
Thu Feb 9 14:32:14 CET 2006


The difficulty is that there's no way to prune the database, either to
adjust the imbalance or to simply decrease the database's size. You have
to start again from scratch. The Spambayes establishment doesn't
consider this to be much of an issue, since (as Seth points out)
Spambayes does a good job of starting from scratch and building an
acceptable scoring system after seeing surprisingly little data.

This is all fine if you can limit your spam flow to a trickle during
this startup period. But if you can't, things can be very unpleasant for
a while. As part of an upgrade of my home system, I recently had
occasion to install Spambayes from scratch on two accounts (mine and my
wife's) that receive a LOT of spam. (My home domain name is a catchy one
that attracts spammers and forgers like flowers attract bees.) So while
Spambayes was in its learning curve, hundreds of spam messages were
pouring in and getting sent to our "possible spam" folders. And because
all I had to train on was ham, anything that didn't go there went to our
inboxes. For two or three days, until Spambayes got its mind right, I
had to dig through this chaff and send it to the spam folders manually -
not a fun task.

Another point (I've made it before, but I guess it bears repeating) is
that the database imbalance is absolutely inherent in the current
implementation of the Spambayes algorithm, at least in the Outlook
plugin. Because users set the cutoffs to avoid false positives (you have
to if the program is going to be useful), virtually all of Spambayes's
mistakes are false negatives. Since mistakes are all you train on after
the initial startup, virtually all new entries into the database are
spam. The better job Spambayes does, the worse the imbalance becomes.
Note that the ham/spam ratio of incoming messages affects only the speed
with which this effect takes hold, not the eventual outcome. If you use
Spambayes correctly, and use it long enough, your database *will*
achieve a highly distorted ham/spam balance. If that degrades
performance, and many believe that it does, then it's a problem that has
yet to be solved.

Bob

MIS Department, City of Cambridge
831 Massachusetts Ave, Cambridge MA 02139  *  617-349-4217  *  fax
617-349-6165


> -----Original Message-----
> From: spambayes-bounces at python.org 
> [mailto:spambayes-bounces at python.org] On Behalf Of Seth Goodman
> Sent: Wednesday, February 08, 2006 6:44 PM
> To: spambayes at python.org
> Subject: Re: [Spambayes] Training on unusual ham - revisited
> 
> 
> On Thursday, February 02, 2006 10:35 PM -0600, Bob Posert wrote:
> 
> > Back in  
> > http://mail.python.org/pipermail/spambayes/2006-January/018702.html
> >  , Tim Peters and I had a dialog about training on unusual ham - 
> > monthly messages from http://www.boldtype.com.  I just got another
one 
> > and it scored 50% on the spam scale.  The clues follow - I'd really 
> > appreciate any help. Thanks, Bob
> >
> >  Combined Score: 50% (0.5) Internal ham score (*H*):  1 Internal
spam 
> > score (*S*): 1
> >
> >  # ham trained on: 1229
> >  #  spam trained on: 20331
> 
> Something else worth mentioning is the large total number of 
> messages in the training set.  While there isn't much 
> evidence that I'm aware of that says this harms accuracy, 
> most people are able to get very good results with a few 
> hundred to a few thousand trained messages.  Some have 
> reported good results with on the order of 50 of each type.  
> If nothing else, this makes the databases very large.
> 
> --
> Seth Goodman
> 
> _______________________________________________
> SpamBayes at python.org http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
> 


More information about the SpamBayes mailing list