[Spambayes] Spam Clues: Re: index.cgi redirection

Kenny Pitt kennypitt at hotmail.com
Wed Nov 12 11:13:25 EST 2003


Grützmacher, Lukas wrote:
> Even I have trained SpamBayes with many (over 100) mails from this
> list as good the most of them are identified as "possible spam". I'm
> currently not able to understand why.  
> 
> 1) Can you explain me what parts of the Spam Clues are calculated to
> reach the 0.386739 (below) for the mail ? (I could not found any
> description in the documentation !?) 2) Is it a problem of SpamBayes
> or of the list or of my configuration ?  
> 
> Spam Score: 39% (0.386739)
> 
> 
> word                                spamprob         #ham  #spam
> 'proto:http'                        0.614138          989    127
> 'can'                               0.61691           415     54
> 'are'                               0.631922          418     58
> 'you'                               0.63923           601     86
> 'header:Date:1'                     0.647151          912    135
> 'header:From:1'                     0.647151          912    135
> 'header:Return-Path:1'              0.653197          888    135
> 'header:Message-ID:1'               0.656727          738    114
> 'to:no real name:2**0'              0.679622          677    116
> 'header:Received:3'                 0.812182          236     83

How many total hams and spams have you trained on?  The clues I left
above particularly stood out to me because you are getting relatively
high spam probabilities even though the ham counts are much higher than
the spam counts.  This usually indicates that you have unbalanced
training data where you have a lot more messages of one type than the
other.  In this case, I would guess several thousand hams vs. only a
couple hundred spams.

Unbalanced training data can cause accuracy problems, and in particular
can make it difficult for additional training to overcome the effects of
words that appear in both ham and spam.  All probabilities are based on
ratios, not absolute numbers.  For a given word, the raw ham ratio is
the number of times the word has been seen in a ham message divided by
the total number of ham messages that have been trained.  The raw spam
ratio is computed the same way, and then the two ratios are combined to
form the spamprob for that word.  If you have trained on 2000 ham
messages, then a word that has appeared 100 times would have a raw ham
score of 0.05.  If you have only trained on 200 spam messages then it
only takes 10 occurences of the word in spam to get the same 0.05 score.

-- 
Kenny Pitt




More information about the Spambayes mailing list