[Spambayes] chi-squared versus "prob strength"

Sun, 13 Oct 2002 13:37:40 +0200

I'm playing currently with a variant on the S/(S+H) formula. I replaced 
it with (S-H+1)/2

Some examples where this doesn't make much difference:

    H        S    S/(H+S)     (S-H+1)/2
   0.01     0.99   0.99         0.99     Typical spam.
   0.99     0.01   0.01         0.01     Typical ham.
   0.50     0.50   0.50         0.50     Typical half-way.
   0.90     0.90   0.50         0.50     Looks both like ham and spam
   0.10     0.10   0.50         0.50     Doesn't look like either
   0.80     0.95   0.54         0.57     Both, but a bit more spam

But where it makes a difference is:

    H        S    S/(H+S)     (S-H+1)/2
   0.05     0.20   0.80         0.57
   0.02     0.05   0.71         0.51

Here, the low S value tells you "I don't have any proof that it looks 
like spam." Just because the H value is even lower, we suddenly put this 
in or close to the realm of certainty using S/(H+S). How come? Well 
we're dividing by H+S, which tells the system we're sure it is either 
ham or spam. If we're fair, however, these messages with H+S<<1 are not
Ham nor Spam. So, maybe we should not divide by H+S at all? Remember, 
the original formula was (S-H)/(S+H). Replace this by (S-H)/1.0 and you 
arrive at my (S-H+1)/2 which puts message that are neither ham nor spam 
close to 0.50

Tim Peters wrote:

> It's been my belief that bland words are at best worthless as clues, and at
> worst actively hurt (experiment:  fiddle your favorite scheme to look *only*
> at the bland words; do they have predictive power?).  I think this is one of
> the schemes where they hurt, for the reason illustrated by tiny example at
> the end of my original post:
> 
> """
> 
>>>>from chi2 import showscore as s
>>>
> 
>>>>s([.2, .8, .9])
>>>
> P(chisq >=    8.27033 | v=  6) =   0.218959
> P(chisq >=    3.87588 | v=  6) =   0.693468
> spam prob 0.781040515476
>  ham prob 0.306531778646
>   S/(S+H) 0.71815043441

  (S-H+1)/2 = 0.737

> 
>>>>s([.2, .8, .9] + [0.5] * 10)
>>>
> P(chisq >=    22.1333 | v= 26) =   0.681383
> P(chisq >=    17.7388 | v= 26) =   0.885068
> spam prob 0.318617174026
>  ham prob 0.114932197304
>   S/(S+H) 0.734904015772

  (S-H+1)/2 = 0.602

Better, isn't it?

Elsewhere you write:

 > Lady with the obnoxious sig:

 > Ignoring bland words:

 > P(chisq >=    222.333 | v=136) = 4.23496e-006
 > P(chisq >=     106.24 | v=136) =   0.972237
 > spam prob 0.999995765045
 >  ham prob 0.0277633711662
 >   S/(S+H) 0.972986500253

  (S-H+1)/2 = 0.986

 > Including bland words:
 >
 > P(chisq >=    282.465 | v=220) = 0.00283528
 > P(chisq >=    163.095 | v=220) =   0.998449
 > spam prob 0.997164718534
 >  ham prob 0.00155126034776
 >   S/(S+H) 0.99844674524

  (S-H+1)/2 = 0.997

The difference is smaller. This small addition of certainty could be due 
to the bland words actually contributing.

 > The ham whose score rose from 0.68 to 0.87:

 > Ignoring bland words:

 > P(chisq >=    123.422 | v=100) =  0.0560948
 > P(chisq >=    97.2217 | v=100) =   0.560026
 > spam prob 0.943905161882
 >  ham prob 0.439974054337
 >   S/(S+H) 0.682071925656

  (S-H+1)/2 = 0.752

 > Including bland words:

 > P(chisq >=    174.229 | v=172) =   0.438174
 > P(chisq >=    146.746 | v=172) =   0.918976
 > spam prob 0.561826411084
 >  ham prob 0.0810237511331
 >   S/(S+H) 0.873961685171

  (S-H+1)/2 = 0.740

Convinced? With this rule, it does no longer harm to add the bland 
words. For my set, with bland words, I end up with
    3 spams < 0.01; 15499 hams  < 0.01
    4 spams < 0.10; 15766 hams  < 0.01
    9 hams  > 0.90;  5658 spams < 0.10
    3 hams  > 0.99;  5392 spams > 0.99

S/S+H left and (S-H+1)/2 right:

cv3s -> cv5s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams

false positive percentages
     0.188  0.062  won    -67.02%
     0.438  0.125  won    -71.46%
     0.125  0.062  won    -50.40%
     0.125  0.062  won    -50.40%
     0.125  0.062  won    -50.40%
     0.062  0.062  tied
     0.250  0.188  won    -24.80%
     0.188  0.250  lost   +32.98%
     0.312  0.188  won    -39.74%
     0.000  0.000  tied

won   7 times
tied  2 times
lost  1 times

total unique fp went from 29 to 17 won    -41.38%
mean fp % went from 0.18125 to 0.10625 won    -41.38%

false negative percentages
     1.034  1.207  lost   +16.73%
     0.345  0.517  lost   +49.86%
     0.345  0.862  lost  +149.86%
     0.517  0.862  lost   +66.73%
     1.207  1.207  tied
     0.690  1.379  lost   +99.86%
     0.690  1.034  lost   +49.86%
     0.345  1.034  lost  +199.71%
     0.517  1.034  lost  +100.00%
     0.862  1.552  lost   +80.05%

won   0 times
tied  1 times
lost  9 times

total unique fn went from 38 to 62 lost   +63.16%
mean fn % went from 0.655172413793 to 1.06896551724 lost   +63.16%

ham mean                     ham sdev
    0.39    0.58  +48.72%        4.46    4.94  +10.76%
    0.60    0.60   +0.00%        6.59    5.74  -12.90%
    0.45    0.60  +33.33%        4.42    4.57   +3.39%
    0.41    0.57  +39.02%        4.51    4.46   -1.11%
    0.36    0.61  +69.44%        4.06    4.63  +14.04%
    0.31    0.41  +32.26%        3.82    4.08   +6.81%
    0.52    0.66  +26.92%        5.72    5.48   -4.20%
    0.51    0.69  +35.29%        5.39    5.74   +6.49%
    0.62    0.70  +12.90%        6.13    5.71   -6.85%
    0.31    0.44  +41.94%        3.24    3.76  +16.05%

ham mean and sdev for all runs
    0.45    0.59  +31.11%        4.94    4.96   +0.40%

spam mean                    spam sdev
   99.32   98.98   -0.34%        5.77    6.32   +9.53%
   99.71   99.25   -0.46%        3.80    4.28  +12.63%
   99.68   99.15   -0.53%        3.23    4.55  +40.87%
   99.44   98.90   -0.54%        6.27    7.00  +11.64%
   99.19   98.96   -0.23%        7.05    6.67   -5.39%
   99.47   98.96   -0.51%        5.24    5.93  +13.17%
   99.50   98.94   -0.56%        5.10    6.17  +20.98%
   99.51   98.95   -0.56%        4.99    5.91  +18.44%
   99.62   99.18   -0.44%        3.20    4.70  +46.88%
   99.39   98.93   -0.46%        5.97    6.40   +7.20%

spam mean and sdev for all runs
   99.48   99.02   -0.46%        5.21    5.86  +12.48%

ham/spam mean difference: 99.03 98.43 -0.60

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/