[spambayes-dev] imbalance within ham or spam training sets?
Kenny Pitt
kennypitt at hotmail.com
Tue Nov 4 11:15:18 EST 2003
Tim Peters wrote:
> I'm not sure we've got the best guess
> to 17 significant digits, though <wink>. Make the imbalance wilder
> and the by-counting spamprob gets wilder too:
>
>>>> h = 1./20000
>>>> s = 1./100
>>>> s/(h+s)
> 0.99502487562189057
>>>>
>
> That offends my intuition -- the word is so rare (2 of 20100 msgs)
> that it's hard to believe that 99.5% is a sane guess. The Bayesian
> adjustment knocks it down a lot based on how few times it's been seen
> in total:
>
>>>> (.45*.5 + 2.0*_)/(.45 + 2.0)
> 0.90410193928317584
>>>>
Wow, that's interesting. I had always considered words that were either
ham or spam, but never a little of both. In a way it makes sense
because 1/20000 ham is so close to zero that the word should be
considered spammy.
This seems even more scary, though. Compare your last example to the
case where the token has only been seen in 1 spam and no ham:
>>> h = 0./20000
>>> s = 1./100
>>> s/(h+s)
1.0
>>> (.45*.5 + 1.*_)/(.45 + 1.)
0.84482758620689669
>>>
The spam prob here is less than the case of 1 ham and 1 spam because of
the "rare word" adjustment. So, if the token has only been seen once in
spam and is later seen once in ham, it gets spammier? Yikes! If we go
to h=10:
>>> h = 10./20000
>>> s = 1./100
>>> s/(h+s)
0.95238095238095233
>>> (.45*.5 + 11.*_)/(.45 + 11.)
0.93460178831357876
>>>
And the spam prob is still going up! So whenever we have an extreme
imbalance like this, the first n occurrences of a token added to the
larger corpus, where n depends on the size of the imbalance, actually
causes the probability of the *opposite* classification to *increase*.
--
Kenny Pitt
More information about the spambayes-dev
mailing list