[Spambayes] Proposing to remove 4 combining schemes

Thu Oct 17 16:25:54 2002

I wrote about the huge certainties in chi2 combining:

>>You can downscale things a bit by reducing the final S,H-score in
>>chi_squared combining before calling chi2Q. Maybe take the sqrt or
>>something similar.
> 
Tim wrote:
> 
> Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
> it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
> 2*sqrt(n)).

I tried it anyway. Here are some results:

Normal:
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
* = 253 items
  0.0 15415 *************************************************************
  0.5    84 *
  1.0    54 *
  1.5    30 *
  2.0    30 *
  2.5    17 *
  3.0    19 *
  3.5    19 *
  4.0    12 *
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100
* = 87 items
95.5   46 *
96.0   17 *
96.5   14 *
97.0   16 *
97.5   21 *
98.0   38 *
98.5   35 *
99.0   92 **
99.5 5300 *************************************************************
-> best cost for all runs: $102.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.495 & 0.96
->     fp 3; fn 14; unsure ham 40; unsure spam 253
->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34%

==================
Dividing the log-products and n by 2:
-> <stat> Ham scores for all runs: 16000 items; mean 0.76; sdev 5.07
-> <stat> min 0; median 1.19013e-05; max 99.9998
-> <stat> fivepctlo 0; fivepcthi 1.54439
* = 242 items
  0.0 14736 *************************************************************
  0.5   316 **
  1.0   134 *
  1.5   103 *
  2.0    74 *
  2.5    60 *
  3.0    37 *
  3.5    35 *
  4.0    34 *
-> <stat> Spam scores for all runs: 5800 items; mean 98.71; sdev 5.97
-> <stat> min 0.000221093; median 100; max 100
-> <stat> fivepctlo 92.9253; fivepcthi 100
* = 83 items
95.5   27 *
96.0   21 *
96.5   35 *
97.0   38 *
97.5   40 *
98.0   59 *
98.5   82 *
99.0  122 **
99.5 5005 *************************************************************
-> best cost for all runs: $104.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.49 & 0.92
->     fp 3; fn 14; unsure ham 43; unsure spam 259
->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39%

=============================================
Dividing the log-products and n by 4:
-> <stat> Ham scores for all runs: 16000 items; mean 1.32; sdev 5.49
-> <stat> min 0; median 0.0140483; max 99.9378
-> <stat> fivepctlo 1.11022e-14; fivepcthi 6.09162
* = 206 items
  0.0 12557 *************************************************************
  0.5   880 *****
  1.0   511 ***
  1.5   298 **
  2.0   223 **
  2.5   176 *
  3.0   135 *
  3.5   113 *
  4.0    91 *
-> <stat> min 0.0626454; median 99.9953; max 100
-> <stat> fivepctlo 87.8576; fivepcthi 100
* = 71 items
95.5   38 *
96.0   54 *
96.5   55 *
97.0   59 *
97.5   70 *
98.0  150 ***
98.5  142 **
99.0  280 ****
99.5 4331 *************************************************************
-> best cost for all runs: $108.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.48 & 0.855
->     fp 4; fn 13; unsure ham 46; unsure spam 230
->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
-> largest ham & spam cutoffs 0.485 & 0.855
->     fp 4; fn 14; unsure ham 42; unsure spam 229
->     fp rate 0.025%; fn rate 0.241%; unsure rate 1.24%


As I expected, this significantly broadens the extremes at only very 
little cost. What this does statistically is downweighting all clues 
thereby taking care of a "standard" correlation between clues. This may 
be functionally equivalent to raising the value of s.

This is the /4 code for reference:

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.38
diff -u -r1.38 classifier.py

--- classifier.py	14 Oct 2002 02:20:35 -0000	1.38
+++ classifier.py	17 Oct 2002 15:24:55 -0000
@@ -516,7 +516,10 @@
          S = ln(S) + Sexp * LN2
          H = ln(H) + Hexp * LN2

-        n = len(clues)
+        S = S/4.0
+        H = H/4.0
+
+        n = len(clues)//4
          if n:
              S = 1.0 - chi2Q(-2.0 * S, 2*n)
              H = 1.0 - chi2Q(-2.0 * H, 2*n)

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/