[spambayes-dev] RE: [Spambayes] question regarding training
tim.peters at gmail.com
Sat Aug 14 05:56:42 CEST 2004
[T. Alexander Popiel]
> I have in the past suggested that the ideal imbalance is related to
> the number of distinct 'topics' in each category. For instance,
> there's only about 4 topics in my ham: spambayes discussions, pennmush
> discussions, administrative mails, and idle chatter from my friends.
> On the other hand, there's many more topics in my spam: delivery
> errors caused by virus joe-jobs, sexual enhancement, mortgage loans,
> weight reduction, nigerian-style scams, must-have lawn ornaments,
> stock pick of the scammer, chain letters, this-is-not-a-marketing-pyramid,
> etc. ...
Just FYI, I've heard several anecdotal reports that the N-way
classical Bayesian classifier POPFile (which is a good one) does a
better job at catching spam if you indeed create several distinct spam
categories (porn spam, mortgage spam, etc), instead of having one
catch-all spam category.
More information about the spambayes-dev