[Spambayes] Proposing to remove 4 combining schemes

Tim Peters tim.one@comcast.net
Sat Oct 19 06:55:16 2002


[Tim, suggests that (S-H+1)/2 would be good to try with gary-combining]

[Rob]
> tim combining:
> -> <stat> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66
> -> <stat> min 0.109175; median 12.3561; max 76.0553
> -> <stat> fivepctlo 1.35543; fivepcthi 31.4327
> -> <stat> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70
> -> <stat> min 21.351; median 85.6889; max 99.8161
> -> <stat> fivepctlo 64.4615; fivepcthi 98.8117
> -> best cost for all runs: $110.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.625
> ->     fp 5; fn 16; unsure ham 35; unsure spam 187
> ->     fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02%

BTW, note that I killed this scheme off -- it was, at the time, trying to
get a better middle ground, but chi-combining works better for that.

> default combining:
> -> <stat> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32
> -> <stat> min 0.137212; median 27.2524; max 65.3836
> -> <stat> fivepctlo 11.7696; fivepcthi 38.3897
> -> <stat> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74
> -> <stat> min 33.8547; median 74.3976; max 99.7559
> -> <stat> fivepctlo 59.9773; fivepcthi 96.4292
> -> best cost for all runs: $106.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.585
> ->     fp 5; fn 16; unsure ham 35; unsure spam 166
> ->     fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922
>
> default combining with P-Q instead of (P-Q)/(P+Q):
> -> <stat> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73
> -> <stat> min 0.123198; median 21.7049; max 68.8251
> -> <stat> fivepctlo 7.34536; fivepcthi 35.6937
> -> <stat> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00
> -> <stat> min 29.348; median 79.2283; max 99.786
> -> <stat> fivepctlo 61.9311; fivepcthi 97.3078
> -> best cost for all runs: $103.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.5 & 0.615
> ->     fp 3; fn 16; unsure ham 37; unsure spam 250
> ->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32%
>
> It is all so close together in the final "cost" result that it is very
> difficult to judge from the statistics.

Then let's take the stats at face value:  these are large runs, so if it
doesn't make a clear difference here, it's unlikely to make a clear
difference anywhere.  IIRC, you were inspired to try S-H under chi-combining
by staring at mistakes where a modest S value was paired with a very low H
value, leading to S/(S+H) approaching 1 despite that S was far from certain
on its own.  But gary-combining is much less extreme in both its S and H
measures, so it's less of a *potential* problem there.  It *may* account for
the two FP that got redeemed in your last run, though -- knowing their
internal S and H values would help (oops -- they're called P and Q inside
the default scheme, but same thing).