[Spambayes] spamprob combining
Tim Peters
tim.one@comcast.net
Fri, 11 Oct 2002 19:53:45 -0400
[Gary Robinson]
> ...
> It's not the product of the p's that is a chi-square distribution, it's
> the following, given p1, p2..., pn:
>
> 2*((ln p1) + (ln p2) + ... + (ln pn))
>
> That expression has a chi-square distribution with 2n degrees of
> freedom.
I haven't found a reference to this online, and don't have reasonable access
to a good technical library, so I need your help to get this straight.
The first thing that strikes me is that it can't be quite right <wink>: a
chi-squared statistic is positive by its very nature, but the expression is
a sum of logs of values in (0., 1), so is necessarily negative.
Here's the chi-squared function I'm using:
"""
import math as _math
def chi2Q(x2, v, exp=_math.exp):
"""Return prob(chisq >= x2, with v degrees of freedom).
v must be even.
"""
assert v & 1 == 0
m = x2 / 2.0
sum = term = exp(-m)
for i in range(1, v//2):
term *= m / i
sum += term
return sum
"""
It's an especially simple and numerically stable calculation when v is even,
and v always is even if the formulation is right. I understand I could save
time with tabulated values, but speeding this is premature.
Example:
>>> chi2Q(129.561, 100)
0.025000686582048785
>>>
Abramowitz & Stegun give 129.561 as the 0.025 point for the chi-squared
distribution with 100 degrees of freedom. Etc. I'm confident *that*
function is working correctly.
> So you feed THAT into the inverse-chi square function to get a p-value.
I'm feeding its negation in instead, since the correct result would be 1.0
for any negative x2 input.
> Let invchi(x, f), where s is the random variable and f is the degrees of
> freedom, be the inverse chi square function. Let S be a number near 1 when
> the email looks spammy and H be a number near 1 when the email
> looks hammy.
>
> Then you want
>
> S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n)
>
> and
>
> H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n)
OK, I believe I'm doing that, but multiplying the first argument by -2
instead of 2:
"""
from Histogram import Hist
from random import random
import sys
h = Hist(20, lo=0.0, hi=1.0)
def judge(ps, ln=_math.log):
H = S = 0.0
for p in ps:
S += ln(1.0 - p)
H += ln(p)
n = len(ps)
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
return S/(S+H)
warp = 0
if len(sys.argv) > 1:
warp = int(sys.argv[1])
for i in range(5000):
ps = [random() for j in range(50)]
p = judge(ps + [0.99] * warp)
h.add(p)
print "Result for random vectors of 50 probs, +", warp, "forced to 0.99"
print
h.display()
"""
Note: as usual, scaling (S-H)/(S+H) from [-1, 1] into [0, 1] is
((S-H)/(S+H) + 1)/2 =
((S-H+S+H)/(S+H))/2 =
(2*S/(S+H))/2 =
S/(S+H)
The bad(?) news is that, on random inputs, this is all over the map:
Result for random vectors of 50 probs, + 0 forced to 0.99
5000 items; mean 0.50; sdev 0.27
-> <stat> min 0.000219435; median 0.49027; max 0.999817
* = 6 items
0.00 206 ***********************************
0.05 215 ************************************
0.10 209 ***********************************
0.15 240 ****************************************
0.20 282 ***********************************************
0.25 239 ****************************************
0.30 270 *********************************************
0.35 289 *************************************************
0.40 276 **********************************************
0.45 325 *******************************************************
0.50 291 *************************************************
0.55 300 **************************************************
0.60 278 ***********************************************
0.65 267 *********************************************
0.70 234 ***************************************
0.75 234 ***************************************
0.80 211 ************************************
0.85 207 ***********************************
0.90 213 ************************************
0.95 214 ************************************
I don't think it's uniformly distributed (across many runs, the small
peakedness near the midpoint persists), but it's close.
The better news is that it's indeed very sensitive to bias (perhaps that's
*why* all-random data scores all over the map?):
Result for random vectors of 50 probs, + 1 forced to 0.99
5000 items; mean 0.59; sdev 0.24
-> <stat> min 0.00175781; median 0.596673; max 0.999818
* = 7 items
0.00 49 *******
0.05 94 **************
0.10 119 *****************
0.15 109 ****************
0.20 153 **********************
0.25 185 ***************************
0.30 214 *******************************
0.35 253 *************************************
0.40 286 *****************************************
0.45 338 *************************************************
0.50 344 **************************************************
0.55 381 *******************************************************
0.60 369 *****************************************************
0.65 340 *************************************************
0.70 325 ***********************************************
0.75 315 *********************************************
0.80 292 ******************************************
0.85 285 *****************************************
0.90 255 *************************************
0.95 294 ******************************************
Result for random vectors of 50 probs, + 2 forced to 0.99
5000 items; mean 0.66; sdev 0.21
-> <stat> min 0.0214171; median 0.667916; max 0.9999
* = 7 items
0.00 4 *
0.05 17 ***
0.10 45 *******
0.15 50 ********
0.20 58 *********
0.25 103 ***************
0.30 137 ********************
0.35 181 **************************
0.40 259 *************************************
0.45 324 ***********************************************
0.50 372 ******************************************************
0.55 377 ******************************************************
0.60 412 ***********************************************************
0.65 427 *************************************************************
0.70 345 **************************************************
0.75 369 *****************************************************
0.80 379 *******************************************************
0.85 370 *****************************************************
0.90 376 ******************************************************
0.95 395 *********************************************************
Result for random vectors of 50 probs, + 10 forced to 0.99
5000 items; mean 0.88; sdev 0.13
-> <stat> min 0.494068; median 0.922177; max 1
* = 33 items
0.00 0
0.05 0
0.10 0
0.15 0
0.20 0
0.25 0
0.30 0
0.35 0
0.40 0
0.45 1 *
0.50 84 ***
0.55 139 *****
0.60 172 ******
0.65 257 ********
0.70 282 *********
0.75 326 **********
0.80 369 ************
0.85 560 *****************
0.90 800 *************************
0.95 2010 *************************************************************
Result for random vectors of 50 probs, + 20 forced to 0.99
5000 items; mean 0.97; sdev 0.06
-> <stat> min 0.543929; median 0.996147; max 1
* = 69 items
0.00 0
0.05 0
0.10 0
0.15 0
0.20 0
0.25 0
0.30 0
0.35 0
0.40 0
0.45 0
0.50 1 *
0.55 4 *
0.60 12 *
0.65 25 *
0.70 32 *
0.75 48 *
0.80 128 **
0.85 203 ***
0.90 383 ******
0.95 4164 *************************************************************
I won't bother to show it here, but the histograms are essentially mirror
images if I force the bias value to 0.01 instead of to 0.99.
Does the near-uniform spread of "scores" on wholly random inputs strike you
as sane or insane? I don't understand the theoretical underpinnings of this
test, so can't really guess. It did surprise me -- I guess I expected
strong clustering at 0.5.