[Spambayes] Proposing to remove 4 combining schemes

Rob Hooft rob@hooft.net
Thu Oct 17 22:49:07 2002


Tim Peters wrote:
> [Rob]
>>Did you ever try tim combining with (S-H+1)/2?
> 
> 
> No, but it would be an excellent idea to try it with the current default
> combining!  tim-combining is unique in that its S is especially sensitive to
> *low*-spamprob words, and its H to high-spamprob words; when something
> really is spam, tim-combining isn't relying so much on having a high S value
> as on having a low H value, so that the ratio S/(S+H) approaches 1.
> Gary-combining is much more like chi-combining in these respects, and
> chi-combining is where the (S-H+1)/2 reformulation helped.

tim combining:
-> <stat> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66
-> <stat> min 0.109175; median 12.3561; max 76.0553
-> <stat> fivepctlo 1.35543; fivepcthi 31.4327
-> <stat> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70
-> <stat> min 21.351; median 85.6889; max 99.8161
-> <stat> fivepctlo 64.4615; fivepcthi 98.8117
-> best cost for all runs: $110.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.625
->     fp 5; fn 16; unsure ham 35; unsure spam 187
->     fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02%

default combining:
-> <stat> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32
-> <stat> min 0.137212; median 27.2524; max 65.3836
-> <stat> fivepctlo 11.7696; fivepcthi 38.3897
-> <stat> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74
-> <stat> min 33.8547; median 74.3976; max 99.7559
-> <stat> fivepctlo 59.9773; fivepcthi 96.4292
-> best cost for all runs: $106.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.585
->     fp 5; fn 16; unsure ham 35; unsure spam 166
->     fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922%

default combining with P-Q instead of (P-Q)/(P+Q):
-> <stat> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73
-> <stat> min 0.123198; median 21.7049; max 68.8251
-> <stat> fivepctlo 7.34536; fivepcthi 35.6937
-> <stat> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00
-> <stat> min 29.348; median 79.2283; max 99.786
-> <stat> fivepctlo 61.9311; fivepcthi 97.3078
-> best cost for all runs: $103.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.615
->     fp 3; fn 16; unsure ham 37; unsure spam 250
->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32%

It is all so close together in the final "cost" result that it is very 
difficult to judge from the statistics.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/