[Spambayes] Upgrading from 1.0a2

Gary Benson gary at inauspicious.org
Fri Dec 5 09:08:09 EST 2003

Richie Hindle wrote:
> Gary Benson wrote"
> > I note that many of the changelog entries are for tokeniser
> > improvements.  Would I have to do a retrain to get these
> > improvements into my database?
> In one way yes, because your current database is the result of
> running the emails through the 1.0a2 tokeniser.  So say you had an
> email containing "via<hide>gra" (which the token now understands,
> but didn't used to) then you'll have a "via" and a "gra" token
> instead of one "viagra" token.  But in another way no, because new
> emails will go through the new tokeniser.  Since you probably have a
> decent spam score for "viagra" already, any new "via<hide>gra" email
> will get a hit for "viagra".
> If you're getting good results, I wouldn't worry about retraining.

For the past few months I've been getting a hit rate of about 97% on
my home account and with only a couple of false positives in that time
(all automated stuff along the lines of 'thanks for signing up to our
website').  My work account is not so good, 94%, but I've not been
using spambayes on it for so long so I imagine that the database is
less well trained.

I worked out that I could drop the spam_cutoff from the default of
0.90 to about 0.65 on my home account, which should bump it up for
next month, but I see Paul Graham quoting hitrates of 99.7% or
whatever and I lust ;)  And I have copies of every email it was trained
on, so a retrain would simply be a matter spending of half an hour or
so writing a script.

On that note, is there anything I can do to the training set to
improve the generated database?  I've seen things like ensuring the
number of spams and hams is roughly equal, for example; is there any
truth in that?  Another thing is that I have some twelve months' worth
of mail in my home training set: should I use it all, or cull some of
the older stuff?


[ gary at inauspicious.org ][ GnuPG 85A8F78B ][ http://inauspicious.org/ ]

