[Spambayes] spamprob combining

Tim Peters tim.one@comcast.net
Fri, 11 Oct 2002 19:53:45 -0400


[Gary Robinson]
> ...
> It's not the product of the p's that is a chi-square distribution, it's
> the following, given p1, p2..., pn:
>
> 2*((ln p1) + (ln p2) + ... + (ln pn))
>
> That expression has a chi-square distribution with 2n degrees of
> freedom.

I haven't found a reference to this online, and don't have reasonable access
to a good technical library, so I need your help to get this straight.

The first thing that strikes me is that it can't be quite right <wink>:  a
chi-squared statistic is positive by its very nature, but the expression is
a sum of logs of values in (0., 1), so is necessarily negative.

Here's the chi-squared function I'm using:

"""
import math as _math

def chi2Q(x2, v, exp=_math.exp):
    """Return prob(chisq >= x2, with v degrees of freedom).

    v must be even.
    """

    assert v & 1 == 0
    m = x2 / 2.0
    sum = term = exp(-m)
    for i in range(1, v//2):
        term *= m / i
        sum += term
    return sum
"""

It's an especially simple and numerically stable calculation when v is even,
and v always is even if the formulation is right.  I understand I could save
time with tabulated values, but speeding this is premature.

Example:

>>> chi2Q(129.561, 100)
0.025000686582048785
>>>

Abramowitz & Stegun give 129.561 as the 0.025 point for the chi-squared
distribution with 100 degrees of freedom.  Etc.  I'm confident *that*
function is working correctly.

> So you feed THAT into the inverse-chi square function to get a p-value.

I'm feeding its negation in instead, since the correct result would be 1.0
for any negative x2 input.

> Let invchi(x, f), where s is the random variable and f is the degrees of
> freedom, be the inverse chi square function. Let S be a number near 1 when
> the email looks spammy and H be a number near 1 when the email
> looks hammy.
>
> Then you want
>
> S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n)
>
> and
>
> H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n)

OK, I believe I'm doing that, but multiplying the first argument by -2
instead of 2:

"""
from Histogram import Hist
from random import random
import sys

h = Hist(20, lo=0.0, hi=1.0)

def judge(ps, ln=_math.log):
    H = S = 0.0
    for p in ps:
        S += ln(1.0 - p)
        H += ln(p)
    n = len(ps)
    S = 1.0 - chi2Q(-2.0 * S, 2*n)
    H = 1.0 - chi2Q(-2.0 * H, 2*n)
    return S/(S+H)

warp = 0
if len(sys.argv) > 1:
    warp = int(sys.argv[1])

for i in range(5000):
    ps = [random() for j in range(50)]
    p = judge(ps + [0.99] * warp)
    h.add(p)

print "Result for random vectors of 50 probs, +", warp, "forced to 0.99"
print
h.display()
"""

Note:  as usual, scaling (S-H)/(S+H) from [-1, 1] into [0, 1] is

    ((S-H)/(S+H) + 1)/2 =
    ((S-H+S+H)/(S+H))/2 =
    (2*S/(S+H))/2 =
    S/(S+H)

The bad(?) news is that, on random inputs, this is all over the map:

Result for random vectors of 50 probs, + 0 forced to 0.99

5000 items; mean 0.50; sdev 0.27
-> <stat> min 0.000219435; median 0.49027; max 0.999817
* = 6 items
0.00 206 ***********************************
0.05 215 ************************************
0.10 209 ***********************************
0.15 240 ****************************************
0.20 282 ***********************************************
0.25 239 ****************************************
0.30 270 *********************************************
0.35 289 *************************************************
0.40 276 **********************************************
0.45 325 *******************************************************
0.50 291 *************************************************
0.55 300 **************************************************
0.60 278 ***********************************************
0.65 267 *********************************************
0.70 234 ***************************************
0.75 234 ***************************************
0.80 211 ************************************
0.85 207 ***********************************
0.90 213 ************************************
0.95 214 ************************************

I don't think it's uniformly distributed (across many runs, the small
peakedness near the midpoint persists), but it's close.

The better news is that it's indeed very sensitive to bias (perhaps that's
*why* all-random data scores all over the map?):

Result for random vectors of 50 probs, + 1 forced to 0.99

5000 items; mean 0.59; sdev 0.24
-> <stat> min 0.00175781; median 0.596673; max 0.999818
* = 7 items
0.00  49 *******
0.05  94 **************
0.10 119 *****************
0.15 109 ****************
0.20 153 **********************
0.25 185 ***************************
0.30 214 *******************************
0.35 253 *************************************
0.40 286 *****************************************
0.45 338 *************************************************
0.50 344 **************************************************
0.55 381 *******************************************************
0.60 369 *****************************************************
0.65 340 *************************************************
0.70 325 ***********************************************
0.75 315 *********************************************
0.80 292 ******************************************
0.85 285 *****************************************
0.90 255 *************************************
0.95 294 ******************************************

Result for random vectors of 50 probs, + 2 forced to 0.99

5000 items; mean 0.66; sdev 0.21
-> <stat> min 0.0214171; median 0.667916; max 0.9999
* = 7 items
0.00   4 *
0.05  17 ***
0.10  45 *******
0.15  50 ********
0.20  58 *********
0.25 103 ***************
0.30 137 ********************
0.35 181 **************************
0.40 259 *************************************
0.45 324 ***********************************************
0.50 372 ******************************************************
0.55 377 ******************************************************
0.60 412 ***********************************************************
0.65 427 *************************************************************
0.70 345 **************************************************
0.75 369 *****************************************************
0.80 379 *******************************************************
0.85 370 *****************************************************
0.90 376 ******************************************************
0.95 395 *********************************************************

Result for random vectors of 50 probs, + 10 forced to 0.99

5000 items; mean 0.88; sdev 0.13
-> <stat> min 0.494068; median 0.922177; max 1
* = 33 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    1 *
0.50   84 ***
0.55  139 *****
0.60  172 ******
0.65  257 ********
0.70  282 *********
0.75  326 **********
0.80  369 ************
0.85  560 *****************
0.90  800 *************************
0.95 2010 *************************************************************

Result for random vectors of 50 probs, + 20 forced to 0.99

5000 items; mean 0.97; sdev 0.06
-> <stat> min 0.543929; median 0.996147; max 1
* = 69 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    1 *
0.55    4 *
0.60   12 *
0.65   25 *
0.70   32 *
0.75   48 *
0.80  128 **
0.85  203 ***
0.90  383 ******
0.95 4164 *************************************************************

I won't bother to show it here, but the histograms are essentially mirror
images if I force the bias value to 0.01 instead of to 0.99.

Does the near-uniform spread of "scores" on wholly random inputs strike you
as sane or insane?  I don't understand the theoretical underpinnings of this
test, so can't really guess.  It did surprise me -- I guess I expected
strong clustering at 0.5.