[Spambayes] Is Equal Ham & Spam really the best?

Brendon Whateley spambayes at whateley.com
Sun Jul 29 00:08:06 CEST 2007


Mark Hammond wrote:
>> That is high relative to the conventional wisdom, but I'm questioning
>> the correctness of that wisdom.
>>     
>
> Check out this thread, which should give you a reasonable idea:
>
> http://mail.python.org/pipermail/spambayes-dev/2003-November/001578.html
>   
That thread was interesting, but still runs under the assumption that
balanced training is the ideal.
>   
>> Perhaps its time to re-evaluate that statement?
>>     
>
> Google also shows anecdotal reports of poor results after an imbalance as
> low as 2:1, so I don't think it would be responsible to re-evaluate that
>   
"responsible"?  I'm not sure what you mean.
> statement until clear evidence was presented to the contrary.
>   
I assumed that running a test to evaluate the effects of imbalance would
be the way to generate or refute such evidence?  When I get back from
Hawaii, I think I'll dust off the old test corpus and try some tests. 
If anybody else has some test results, I'd be very interested in seeing
them.

My current thought is that getting a (very) large mount of spam with
very few clues results in each email results in the imbalance.  I've
just checked some of todays spam and some had as few as 31 clues.  With
so few clues, it is relatively easy for a spam message to end up with an
unsure or even ham classification while the most ham is being correctly
classified.  The alternative to an imbalanced training set is to find an
easy way to train on extra ham, but only the ham that still has some
classification value to add.

Brendon.



More information about the SpamBayes mailing list