[Spambayes] ageing out database entries

Kenny Pitt kennypitt at hotmail.com
Mon Nov 17 10:15:11 EST 2003


Seth Goodman wrote:
> Kenny,
> 
> ...  Do you have any
> comments on the stuff at the bottom of my previous post (reproduced
> below)?  This is what motivated the discussion of K9 training in the
> first place and I would value any insight or suggestions that anyone
> could offer. 
> 
>> Well, what bothers me, so far, is that despite training on 620
>> ham and 1403
>> spam, SpamBayes still manages to miss (score as ham) 5-10 messages
>> per day out of around 150 scored messages.  Most of these missed
>> spams have an initial score very close to zero, so simply lowering
>> the ham 
>> threshold would
>> not fix it.  After training as spam, their spam score often increases
>> respectably, but sometimes, the score stays below 5%.  This
>> indicates that the same message would be missed next time, as well. 
>> I don't 
>> know if I just
>> need to get a bigger or more balanced training set, if there are
>> some types
>> of tokens (such as embedded URL's in HTML spam) that are not
>> currently parsed or if this is just as good as it gets.  Anyway,
>> that's what I would like to see improved and it is the motivation
>> for the above discussion. 
>> 
>> Any thoughts from those who've been there already?

There is currently a discussion developing under the subject "SpamBayes
now filers less than 50% of my spam" regarding the significance of
imbalance and possible difficulties as you train on larger numbers of
messages.  Some of it may apply to your situation.

In your case, my guess is that you should consider how similar are the
spam messages that got missed to previous spams that you have
classified.  You have 1403 spam messages trained, so even training 5 new
spams isn't much in relation to the total number of spams if the new
spam tokens haven't been seen much before.  If those spams contain
tokens that also appear in your ham, then it could potentially take a
significant amount of training to counteract the effects of the ham
tokens.  You have fewer trained hams than you do spams, so a token in
the ham training will contribute a little more strongly to the overall
probability than the same token in the spam training.

-- 
Kenny Pitt




More information about the Spambayes mailing list