[spambayes-dev] imbalance within ham or spam training sets?

Tim Peters tim.one at comcast.net
Tue Nov 18 00:33:30 EST 2003


[Tim, quite a while ago]
>> I'm not sure we've got the best guess
>> to 17 significant digits, though <wink>.  Make the imbalance wilder
>> and the by-counting spamprob gets wilder too:
>>
>> >>> h = 1./20000
>> >>> s = 1./100
>> >>> s/(h+s)
>> 0.99502487562189057
>> >>>
>>
>> That offends my intuition -- the word is so rare (2 of 20100 msgs)
>> that it's hard to believe that 99.5% is a sane guess.  The Bayesian
>> adjustment knocks it down a lot based on how few times it's been
>> seen in total:
>>
>> >>> (.45*.5 + 2.0*_)/(.45 + 2.0) 0.90410193928317584
>> >>>

[Kenny Pitt]
> Wow, that's interesting.  I had always considered words that were
> either ham or spam, but never a little of both.  In a way it makes
> sense because 1/20000 ham is so close to zero that the word should be
> considered spammy.
>
> This seems even more scary, though.  Compare your last example to the
> case where the token has only been seen in 1 spam and no ham:
>
> >>> h = 0./20000
> >>> s = 1./100
> >>> s/(h+s)
> 1.0
> >>> (.45*.5 + 1.*_)/(.45 + 1.)
> 0.84482758620689669
> >>>
>
> The spam prob here is less than the case of 1 ham and 1 spam because
> of the "rare word" adjustment.  So, if the token has only been seen
> once in spam and is later seen once in ham, it gets spammier?  Yikes!
> If we go to h=10:
>
> >>> h = 10./20000
> >>> s = 1./100
> >>> s/(h+s)
> 0.95238095238095233
> >>> (.45*.5 + 11.*_)/(.45 + 11.) 0.93460178831357876
>
> And the spam prob is still going up!  So whenever we have an extreme
> imbalance like this, the first n occurrences of a token added to the
> larger corpus, where n depends on the size of the imbalance, actually
> causes the probability of the *opposite* classification to *increase*.

That's an excellent analysis, and I repeated it in full so that it's easier
to find later <wink>.  This is a systematically counterintuitive effect
that's inevitable when working with highly unbalanced training data.  I know
Gary Robinson is thinking "about stuff like this" now, and I hope he has
time to dream up a better way to cope.

One gloss:

> I had always considered words that were either ham or spam, but never a
> little of both.

Most words are like that!  If you dig thru your entire database, and ignore
hapaxes (words that appeared only once total across all training data), I
bet you'll find that few appeared only in ham or only in spam.  Ours is a
"preponderance of evidence" scheme, not a "smoking gun" scheme.  That's what
makes it hard to fool (no fixed word, or even collection of words, is/are
strong enough on their own to force a decision).




More information about the spambayes-dev mailing list