[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 19:34:58 EST 2003

In message:  <LNBBLJKPBEHFEDALKOLCGEKIGOAB.tim.one at comcast.net>
             "Tim Peters" <tim.one at comcast.net> writes:
>>
>> Perhaps it's time to test a variation where the prob is based on
>> hamcount and spamcount instead of hamratio and spamratio.  Hrm.
>> *tap, tap, tap*  I'll be back in a few hours...
>
>Well, they're all the same if the # of training ham == the # of training
>spam.  Computing spambprobs based on ratios is a first attempt at surviving
>in the face of unbalanced training data.

Hrm, yes.  I'm obviously not thinking all that well today.
This gives leads me to thoughts where the elements of the
probability are scaled nonlinearly by the ham/spam imbalance
before combining them into the prob, instead of scaling the
perceived number of messages (and thus effecively scaling
unknown_word_strength) afterward...

Time to cogitate on which continuous asymptotic functions
might be effective at this.

>IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount"
>gives if we extrapolate our actual training data to what it would be if it
>were balanced.  If it's already balanced, the computed spamprob is the same
>whether computed by raw count or by ratio.  So if you try raw count, the
>only interesting tests would be on unbalanced training data.

I'm currently testing against my RL data, which is between
60% and 70% spam overall (rising to about 90% spam in recent
weeks).

- Alex