[Spambayes] Spam Clues: Re: index.cgi redirection

"Grützmacher, Lukas" gruetzmacher at ais-dresden.de
Wed Nov 12 11:27:34 EST 2003


The SpamBayes Manager reports the training status as about 1600 ham and 135 spam mails, even I think I had more spam mails.

Do I understand you right: Because I have more ham then spam mails my training becomes unbalanced ?

Lukas

> -----Original Message-----
> From: Kenny Pitt [mailto:kennypitt at hotmail.com]
> Sent: Wednesday, November 12, 2003 5:13 PM
> To: Grützmacher, Lukas; spambayes at python.org
> Subject: RE: [Spambayes] Spam Clues: Re: index.cgi redirection
> 
> 
> Grützmacher, Lukas wrote:
> > Even I have trained SpamBayes with many (over 100) mails from this
> > list as good the most of them are identified as "possible spam". I'm
> > currently not able to understand why.  
> > 
> > 1) Can you explain me what parts of the Spam Clues are calculated to
> > reach the 0.386739 (below) for the mail ? (I could not found any
> > description in the documentation !?) 2) Is it a problem of SpamBayes
> > or of the list or of my configuration ?  
> > 
> > Spam Score: 39% (0.386739)
> > 
> > 
> > word                                spamprob         #ham  #spam
> > 'proto:http'                        0.614138          989    127
> > 'can'                               0.61691           415     54
> > 'are'                               0.631922          418     58
> > 'you'                               0.63923           601     86
> > 'header:Date:1'                     0.647151          912    135
> > 'header:From:1'                     0.647151          912    135
> > 'header:Return-Path:1'              0.653197          888    135
> > 'header:Message-ID:1'               0.656727          738    114
> > 'to:no real name:2**0'              0.679622          677    116
> > 'header:Received:3'                 0.812182          236     83
> 
> How many total hams and spams have you trained on?  The clues I left
> above particularly stood out to me because you are getting relatively
> high spam probabilities even though the ham counts are much 
> higher than
> the spam counts.  This usually indicates that you have unbalanced
> training data where you have a lot more messages of one type than the
> other.  In this case, I would guess several thousand hams vs. only a
> couple hundred spams.
> 
> Unbalanced training data can cause accuracy problems, and in 
> particular
> can make it difficult for additional training to overcome the 
> effects of
> words that appear in both ham and spam.  All probabilities 
> are based on
> ratios, not absolute numbers.  For a given word, the raw ham ratio is
> the number of times the word has been seen in a ham message divided by
> the total number of ham messages that have been trained.  The raw spam
> ratio is computed the same way, and then the two ratios are 
> combined to
> form the spamprob for that word.  If you have trained on 2000 ham
> messages, then a word that has appeared 100 times would have a raw ham
> score of 0.05.  If you have only trained on 200 spam messages then it
> only takes 10 occurences of the word in spam to get the same 
> 0.05 score.
> 
> -- 
> Kenny Pitt
> 
> 



More information about the Spambayes mailing list