[Spambayes] chi-squared versus "prob strength"
Rob Hooft
rob@hooft.net
Sun, 13 Oct 2002 13:37:40 +0200
I'm playing currently with a variant on the S/(S+H) formula. I replaced
it with (S-H+1)/2
Some examples where this doesn't make much difference:
H S S/(H+S) (S-H+1)/2
0.01 0.99 0.99 0.99 Typical spam.
0.99 0.01 0.01 0.01 Typical ham.
0.50 0.50 0.50 0.50 Typical half-way.
0.90 0.90 0.50 0.50 Looks both like ham and spam
0.10 0.10 0.50 0.50 Doesn't look like either
0.80 0.95 0.54 0.57 Both, but a bit more spam
But where it makes a difference is:
H S S/(H+S) (S-H+1)/2
0.05 0.20 0.80 0.57
0.02 0.05 0.71 0.51
Here, the low S value tells you "I don't have any proof that it looks
like spam." Just because the H value is even lower, we suddenly put this
in or close to the realm of certainty using S/(H+S). How come? Well
we're dividing by H+S, which tells the system we're sure it is either
ham or spam. If we're fair, however, these messages with H+S<<1 are not
Ham nor Spam. So, maybe we should not divide by H+S at all? Remember,
the original formula was (S-H)/(S+H). Replace this by (S-H)/1.0 and you
arrive at my (S-H+1)/2 which puts message that are neither ham nor spam
close to 0.50
Tim Peters wrote:
> It's been my belief that bland words are at best worthless as clues, and at
> worst actively hurt (experiment: fiddle your favorite scheme to look *only*
> at the bland words; do they have predictive power?). I think this is one of
> the schemes where they hurt, for the reason illustrated by tiny example at
> the end of my original post:
>
> """
>
>>>>from chi2 import showscore as s
>>>
>
>>>>s([.2, .8, .9])
>>>
> P(chisq >= 8.27033 | v= 6) = 0.218959
> P(chisq >= 3.87588 | v= 6) = 0.693468
> spam prob 0.781040515476
> ham prob 0.306531778646
> S/(S+H) 0.71815043441
(S-H+1)/2 = 0.737
>
>>>>s([.2, .8, .9] + [0.5] * 10)
>>>
> P(chisq >= 22.1333 | v= 26) = 0.681383
> P(chisq >= 17.7388 | v= 26) = 0.885068
> spam prob 0.318617174026
> ham prob 0.114932197304
> S/(S+H) 0.734904015772
(S-H+1)/2 = 0.602
Better, isn't it?
Elsewhere you write:
> Lady with the obnoxious sig:
> Ignoring bland words:
> P(chisq >= 222.333 | v=136) = 4.23496e-006
> P(chisq >= 106.24 | v=136) = 0.972237
> spam prob 0.999995765045
> ham prob 0.0277633711662
> S/(S+H) 0.972986500253
(S-H+1)/2 = 0.986
> Including bland words:
>
> P(chisq >= 282.465 | v=220) = 0.00283528
> P(chisq >= 163.095 | v=220) = 0.998449
> spam prob 0.997164718534
> ham prob 0.00155126034776
> S/(S+H) 0.99844674524
(S-H+1)/2 = 0.997
The difference is smaller. This small addition of certainty could be due
to the bland words actually contributing.
> The ham whose score rose from 0.68 to 0.87:
> Ignoring bland words:
> P(chisq >= 123.422 | v=100) = 0.0560948
> P(chisq >= 97.2217 | v=100) = 0.560026
> spam prob 0.943905161882
> ham prob 0.439974054337
> S/(S+H) 0.682071925656
(S-H+1)/2 = 0.752
> Including bland words:
> P(chisq >= 174.229 | v=172) = 0.438174
> P(chisq >= 146.746 | v=172) = 0.918976
> spam prob 0.561826411084
> ham prob 0.0810237511331
> S/(S+H) 0.873961685171
(S-H+1)/2 = 0.740
Convinced? With this rule, it does no longer harm to add the bland
words. For my set, with bland words, I end up with
3 spams < 0.01; 15499 hams < 0.01
4 spams < 0.10; 15766 hams < 0.01
9 hams > 0.90; 5658 spams < 0.10
3 hams > 0.99; 5392 spams > 0.99
S/S+H left and (S-H+1)/2 right:
cv3s -> cv5s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
false positive percentages
0.188 0.062 won -67.02%
0.438 0.125 won -71.46%
0.125 0.062 won -50.40%
0.125 0.062 won -50.40%
0.125 0.062 won -50.40%
0.062 0.062 tied
0.250 0.188 won -24.80%
0.188 0.250 lost +32.98%
0.312 0.188 won -39.74%
0.000 0.000 tied
won 7 times
tied 2 times
lost 1 times
total unique fp went from 29 to 17 won -41.38%
mean fp % went from 0.18125 to 0.10625 won -41.38%
false negative percentages
1.034 1.207 lost +16.73%
0.345 0.517 lost +49.86%
0.345 0.862 lost +149.86%
0.517 0.862 lost +66.73%
1.207 1.207 tied
0.690 1.379 lost +99.86%
0.690 1.034 lost +49.86%
0.345 1.034 lost +199.71%
0.517 1.034 lost +100.00%
0.862 1.552 lost +80.05%
won 0 times
tied 1 times
lost 9 times
total unique fn went from 38 to 62 lost +63.16%
mean fn % went from 0.655172413793 to 1.06896551724 lost +63.16%
ham mean ham sdev
0.39 0.58 +48.72% 4.46 4.94 +10.76%
0.60 0.60 +0.00% 6.59 5.74 -12.90%
0.45 0.60 +33.33% 4.42 4.57 +3.39%
0.41 0.57 +39.02% 4.51 4.46 -1.11%
0.36 0.61 +69.44% 4.06 4.63 +14.04%
0.31 0.41 +32.26% 3.82 4.08 +6.81%
0.52 0.66 +26.92% 5.72 5.48 -4.20%
0.51 0.69 +35.29% 5.39 5.74 +6.49%
0.62 0.70 +12.90% 6.13 5.71 -6.85%
0.31 0.44 +41.94% 3.24 3.76 +16.05%
ham mean and sdev for all runs
0.45 0.59 +31.11% 4.94 4.96 +0.40%
spam mean spam sdev
99.32 98.98 -0.34% 5.77 6.32 +9.53%
99.71 99.25 -0.46% 3.80 4.28 +12.63%
99.68 99.15 -0.53% 3.23 4.55 +40.87%
99.44 98.90 -0.54% 6.27 7.00 +11.64%
99.19 98.96 -0.23% 7.05 6.67 -5.39%
99.47 98.96 -0.51% 5.24 5.93 +13.17%
99.50 98.94 -0.56% 5.10 6.17 +20.98%
99.51 98.95 -0.56% 4.99 5.91 +18.44%
99.62 99.18 -0.44% 3.20 4.70 +46.88%
99.39 98.93 -0.46% 5.97 6.40 +7.20%
spam mean and sdev for all runs
99.48 99.02 -0.46% 5.21 5.86 +12.48%
ham/spam mean difference: 99.03 98.43 -0.60
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/