[Spambayes] RE: chi-combining
Tim Peters
tim.one@comcast.net
Tue Nov 19 04:21:36 2002
In an offline thread with Greg Louis (who's working on bogofilter), I tried
an experiment using just the S, then just the H, components of our spamprob
calculation. We currently return (1+S-H)/2. The "justs" result here just
returns S, the "justh" just returns 1-H. justs is a comparative disaster,
but the more I stare at it, the more I think justh did surprisingly well:
filename: base justs justh
ham:spam: 6000:6000 6000:6000
6000:6000
fp total: 2 8 2
fp %: 0.03 0.13 0.03
fn total: 0 0 4
fn %: 0.00 0.00 0.07
unsure t: 40 59 6
unsure %: 0.33 0.49 0.05
real cost: $28.00 $91.80 $25.20
best cost: $4.00 $22.40 $6.60
h mean: 0.38 0.69 0.08
h sdev: 3.53 5.81 2.18
s mean: 99.96 99.99 99.92
s sdev: 1.41 0.45 2.58
mean diff: 99.58 99.30 99.84
k: 20.16 15.86 20.97
Similar results were obtained from another trial on different 6K samples
from my c.l.py test data. If you hate FP a lot, and would rather suffer a
few FN in return for skipping lots of unsures, justh looks like it may be a
viable strategy. Despite that H is less sensitive to high-spamprob words
than to low-spamprob words (and S the reverse), at least on this data spam
still scores very high under H.
If you want to try this, in chi2_spamprob replace
prob = (S-H + 1.0) / 2.0
with
prob = 1.0 - H
More information about the Spambayes
mailing list