[Spambayes] Training on unusual ham - revisited

Sun Feb 12 05:15:24 CET 2006

> Back in http://mail.python.org/pipermail/spambayes/2006-January/ 
> 018702.html , Tim Peters and I had a dialog about training on  
> unusual ham - monthly messages from http://www.boldtype.com.  I  
> just got another one and it scored 50% on the spam scale.  The  
> clues follow - [...]
>  Combined Score: 50% (0.5) Internal ham score (*H*):  1
>  Internal spam score (*S*): 1

IOW, the message looked a lot like ham, *and* a lot like spam.

>  # ham trained on: 1229
>  #  spam trained on: 20331

As others have said, there's quite an imbalance here, as well as  
quite a large database.  My personal opinion (which is backed up by  
at least some of the research) is that larger databases are worse.

> '1950'                              0.97619             0      9
> [...]
> 'broke'                             0.997512            0     90
> 'accordance'                        0.998921            0    208
> 'discreet'                          0.999019            0    229

None of the spam clues look very spammy to me (although I don't know  
what you consider spam of course).  Do you have any idea what the 9  
to 90 messages that had these clues were?  Were these all in some  
sort of 'word salad' spam?  If so, then perhaps avoid training these  
would help (and I believe the large database and the imbalance will  
contribute to the problem).

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.