[spambayes-dev] imbalance within ham or spam training sets?
Tim Peters
tim.one at comcast.net
Tue Nov 18 00:33:30 EST 2003
[Tim, quite a while ago]
>> I'm not sure we've got the best guess
>> to 17 significant digits, though <wink>. Make the imbalance wilder
>> and the by-counting spamprob gets wilder too:
>>
>> >>> h = 1./20000
>> >>> s = 1./100
>> >>> s/(h+s)
>> 0.99502487562189057
>> >>>
>>
>> That offends my intuition -- the word is so rare (2 of 20100 msgs)
>> that it's hard to believe that 99.5% is a sane guess. The Bayesian
>> adjustment knocks it down a lot based on how few times it's been
>> seen in total:
>>
>> >>> (.45*.5 + 2.0*_)/(.45 + 2.0) 0.90410193928317584
>> >>>
[Kenny Pitt]
> Wow, that's interesting. I had always considered words that were
> either ham or spam, but never a little of both. In a way it makes
> sense because 1/20000 ham is so close to zero that the word should be
> considered spammy.
>
> This seems even more scary, though. Compare your last example to the
> case where the token has only been seen in 1 spam and no ham:
>
> >>> h = 0./20000
> >>> s = 1./100
> >>> s/(h+s)
> 1.0
> >>> (.45*.5 + 1.*_)/(.45 + 1.)
> 0.84482758620689669
> >>>
>
> The spam prob here is less than the case of 1 ham and 1 spam because
> of the "rare word" adjustment. So, if the token has only been seen
> once in spam and is later seen once in ham, it gets spammier? Yikes!
> If we go to h=10:
>
> >>> h = 10./20000
> >>> s = 1./100
> >>> s/(h+s)
> 0.95238095238095233
> >>> (.45*.5 + 11.*_)/(.45 + 11.) 0.93460178831357876
>
> And the spam prob is still going up! So whenever we have an extreme
> imbalance like this, the first n occurrences of a token added to the
> larger corpus, where n depends on the size of the imbalance, actually
> causes the probability of the *opposite* classification to *increase*.
That's an excellent analysis, and I repeated it in full so that it's easier
to find later <wink>. This is a systematically counterintuitive effect
that's inevitable when working with highly unbalanced training data. I know
Gary Robinson is thinking "about stuff like this" now, and I hope he has
time to dream up a better way to cope.
One gloss:
> I had always considered words that were either ham or spam, but never a
> little of both.
Most words are like that! If you dig thru your entire database, and ignore
hapaxes (words that appeared only once total across all training data), I
bet you'll find that few appeared only in ham or only in spam. Ours is a
"preponderance of evidence" scheme, not a "smoking gun" scheme. That's what
makes it hard to fool (no fixed word, or even collection of words, is/are
strong enough on their own to force a decision).
More information about the spambayes-dev
mailing list