[Spambayes] RE: chi-combining

Tue Nov 19 04:21:36 2002

In an offline thread with Greg Louis (who's working on bogofilter), I tried
an experiment using just the S, then just the H, components of our spamprob
calculation.  We currently return (1+S-H)/2.  The "justs" result here just
returns S, the "justh" just returns 1-H.  justs is a comparative disaster,
but the more I stare at it, the more I think justh did surprisingly well:

filename:     base   justs   justh
ham:spam:  6000:6000       6000:6000
                   6000:6000
fp total:        2       8       2
fp %:         0.03    0.13    0.03
fn total:        0       0       4
fn %:         0.00    0.00    0.07
unsure t:       40      59       6
unsure %:     0.33    0.49    0.05
real cost:  $28.00  $91.80  $25.20
best cost:   $4.00  $22.40   $6.60
h mean:       0.38    0.69    0.08
h sdev:       3.53    5.81    2.18
s mean:      99.96   99.99   99.92
s sdev:       1.41    0.45    2.58
mean diff:   99.58   99.30   99.84
k:           20.16   15.86   20.97

Similar results were obtained from another trial on different 6K samples
from my c.l.py test data.  If you hate FP a lot, and would rather suffer a
few FN in return for skipping lots of unsures, justh looks like it may be a
viable strategy.  Despite that H is less sensitive to high-spamprob words
than to low-spamprob words (and S the reverse), at least on this data spam
still scores very high under H.

If you want to try this, in chi2_spamprob replace

            prob = (S-H + 1.0) / 2.0

with

            prob = 1.0 - H