[Spambayes] Training on unusual ham - revisited
tameyer at ihug.co.nz
Sun Feb 12 05:15:24 CET 2006
> Back in http://mail.python.org/pipermail/spambayes/2006-January/
> 018702.html , Tim Peters and I had a dialog about training on
> unusual ham - monthly messages from http://www.boldtype.com. I
> just got another one and it scored 50% on the spam scale. The
> clues follow - [...]
> Combined Score: 50% (0.5) Internal ham score (*H*): 1
> Internal spam score (*S*): 1
IOW, the message looked a lot like ham, *and* a lot like spam.
> # ham trained on: 1229
> # spam trained on: 20331
As others have said, there's quite an imbalance here, as well as
quite a large database. My personal opinion (which is backed up by
at least some of the research) is that larger databases are worse.
> '1950' 0.97619 0 9
> 'broke' 0.997512 0 90
> 'accordance' 0.998921 0 208
> 'discreet' 0.999019 0 229
None of the spam clues look very spammy to me (although I don't know
what you consider spam of course). Do you have any idea what the 9
to 90 messages that had these clues were? Were these all in some
sort of 'word salad' spam? If so, then perhaps avoid training these
would help (and I believe the large database and the imbalance will
contribute to the problem).
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes