[Spambayes] Upgrading from 1.0a2
Gary Benson
gary at inauspicious.org
Fri Dec 5 09:08:09 EST 2003
Richie Hindle wrote:
> Gary Benson wrote"
> > I note that many of the changelog entries are for tokeniser
> > improvements. Would I have to do a retrain to get these
> > improvements into my database?
>
> In one way yes, because your current database is the result of
> running the emails through the 1.0a2 tokeniser. So say you had an
> email containing "via<hide>gra" (which the token now understands,
> but didn't used to) then you'll have a "via" and a "gra" token
> instead of one "viagra" token. But in another way no, because new
> emails will go through the new tokeniser. Since you probably have a
> decent spam score for "viagra" already, any new "via<hide>gra" email
> will get a hit for "viagra".
>
> If you're getting good results, I wouldn't worry about retraining.
For the past few months I've been getting a hit rate of about 97% on
my home account and with only a couple of false positives in that time
(all automated stuff along the lines of 'thanks for signing up to our
website'). My work account is not so good, 94%, but I've not been
using spambayes on it for so long so I imagine that the database is
less well trained.
I worked out that I could drop the spam_cutoff from the default of
0.90 to about 0.65 on my home account, which should bump it up for
next month, but I see Paul Graham quoting hitrates of 99.7% or
whatever and I lust ;) And I have copies of every email it was trained
on, so a retrain would simply be a matter spending of half an hour or
so writing a script.
On that note, is there anything I can do to the training set to
improve the generated database? I've seen things like ensuring the
number of spams and hams is roughly equal, for example; is there any
truth in that? Another thing is that I have some twelve months' worth
of mail in my home training set: should I use it all, or cull some of
the older stuff?
Cheers,
Gary
[ gary at inauspicious.org ][ GnuPG 85A8F78B ][ http://inauspicious.org/ ]
More information about the Spambayes
mailing list