[Spambayes] chi-squared versus "prob strength"

Tim Peters tim.one@comcast.net
Sun, 13 Oct 2002 21:05:28 -0400


[Rob Hooft]
> I'm playing currently with a variant on the S/(S+H) formula. I replaced
> it with (S-H+1)/2

[and then shows specific examples where this gives intuitively more-
 sensible endcase results than the current rule]

> ...
> Better, isn't it?
> ...
> Convinced?

I was, but more importantly my test data agreed, so I'm going to switch to
this (the evidence is so consistent and solid on both our datasets that
making it an option would supply a pointless choice -- losers are killed).
Good show!

S/(S+H) before, (S-H+1)/2 after (all defaults except
use_chi_squared_combining in both):

ham mean                     ham sdev
   0.39    0.29  -25.64%        3.47    2.98  -14.12%
   0.33    0.24  -27.27%        3.13    2.66  -15.02%
   0.40    0.31  -22.50%        3.54    3.23   -8.76%
   0.23    0.16  -30.43%        2.24    1.78  -20.54%
   0.47    0.39  -17.02%        4.38    4.06   -7.31%
   0.31    0.24  -22.58%        3.05    2.73  -10.49%
   0.38    0.28  -26.32%        3.23    2.71  -16.10%
   0.29    0.21  -27.59%        2.80    2.35  -16.07%
   0.30    0.23  -23.33%        2.90    2.51  -13.45%
   0.55    0.43  -21.82%        4.45    4.08   -8.31%

ham mean and sdev for all runs
   0.36    0.28  -22.22%        3.38    2.99  -11.54%

spam mean                    spam sdev
  99.93   99.95   +0.02%        1.25    1.01  -19.20%
  99.94   99.96   +0.02%        1.24    1.11  -10.48%
  99.98   99.99   +0.01%        0.34    0.19  -44.12%
  99.92   99.93   +0.01%        1.84    1.93   +4.89%
  99.93   99.94   +0.01%        1.72    1.59   -7.56%
  99.88   99.90   +0.02%        1.95    1.72  -11.79%
  99.86   99.88   +0.02%        2.22    2.27   +2.25%
  99.91   99.94   +0.03%        1.26    0.83  -34.13%
  99.90   99.92   +0.02%        1.75    1.55  -11.43%
  99.96   99.97   +0.01%        0.73    0.43  -41.10%

spam mean and sdev for all runs
  99.92   99.94   +0.02%        1.53    1.41   -7.84%

ham/spam mean difference: 99.56 99.66 +0.10

So it's even more extreme this way, but not in a way that hurts:  the weird
msgs in "the middle ground" are even more reliably *in* the middle ground
now.  For example, in my data, conference announcements, and the very
difficult but rare long & chatty spam, almost always end up scoring near 0.5
now.  But the regions of "extreme certainty" contain more msgs at the same
time:

HAM BEFORE

-> <stat> Ham scores for all runs: 20000 items; mean 0.36; sdev 3.38
-> <stat> min -1.9984e-013; median 1.18333e-010; max 100
* = 319 items
 0.0 19401 *************************************************************
 0.5    97 *

HAM AFTER
-> <stat> Ham scores for all runs: 20000 items; mean 0.28; sdev 2.99
-> <stat> min -9.99201e-014; median 6.28553e-011; max 100
* = 320 items
 0.0 19492 *************************************************************
 0.5   104 *

Median, mean and sdev all decreased, and about 100 more hams scored below
0.05.

SPAM BEFORE

-> <stat> Spam scores for all runs: 14000 items; mean 99.92; sdev 1.53
-> <stat> min 35.983; median 100; max 100
* = 228 items
99.0    15 *
99.5 13906 *************************************************************

SPAM AFTER

-> <stat> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.41
-> <stat> min 29.6176; median 100; max 100
* = 229 items
99.0    13 *
99.5 13918 *************************************************************

The effects are milder here, but still in the right direction.

The "BlackIntrepid" spam is the min-scoring spam in both cases:

prob('*H*') = 0.930885
prob('*S*') = 0.523237

Chop that up any way you want, it's always going to look more like ham than
spam, and it does look a lot like legit c.l.py traffic.

cvcost doesn't find much bottom-line difference:

chisq.txt: Optimal cost is $27.2 with grey zone between 50.0 and 74.0
chisq_altsh.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0

Given that I have two false positives that are never going to go away, and
they're charged $10 each, the cost of both methods for 34,000 msgs is
trivial.