[Spambayes] z-combining

T. Alexander Popiel popiel@wolfskeep.com
Mon, 14 Oct 2002 14:53:06 -0700


Well, I did a z-combining run.  @whee.  It replaces my
all-defaults run as cv1.  chi-square remains as cv2.

>From results.txt:

"""
ham mean                     ham sdev
   0.50    0.50   +0.00%        7.05    7.04   -0.14%
   0.26    0.27   +3.85%        3.65    3.71   +1.64%
   0.02    0.04 +100.00%        0.29    0.41  +41.38%
   0.49    0.41  -16.33%        5.44    4.13  -24.08%
   0.38    0.36   -5.26%        5.27    4.84   -8.16%
   1.03    1.01   -1.94%        9.88    9.42   -4.66%
   0.51    0.51   +0.00%        5.56    5.47   -1.62%
   0.09    0.16  +77.78%        1.26    1.94  +53.97%
   0.97    0.95   -2.06%        9.66    9.40   -2.69%
   0.12    0.14  +16.67%        1.73    1.88   +8.67%

ham mean and sdev for all runs
   0.44    0.44   +0.00%        5.90    5.65   -4.24%

spam mean                    spam sdev
  98.68   98.42   -0.26%       10.66   10.85   +1.78%
  99.31   99.26   -0.05%        5.62    5.56   -1.07%
  97.68   97.82   +0.14%       13.94   12.18  -12.63%
  98.84   98.85   +0.01%        9.00    8.90   -1.11%
  98.54   98.55   +0.01%       11.71    9.65  -17.59%
  97.99   98.31   +0.33%       13.48   11.21  -16.84%
  96.88   97.25   +0.38%       15.83   13.12  -17.12%
  99.34   98.98   -0.36%        4.95    6.15  +24.24%
  98.07   98.26   +0.19%       11.74   10.37  -11.67%
  99.65   99.01   -0.64%        3.04    5.46  +79.61%

spam mean and sdev for all runs
  98.50   98.47   -0.03%       10.81    9.72  -10.08%

ham/spam mean difference: 98.06 98.03 -0.03
"""

z-combining loses vs. chi-square there, with looser sdevs.

Next, we have the best computations for z-combining:

"""
-> best cost $54.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 6 cutoff pairs
-> smallest ham & spam cutoffs 0.01 & 0.985
->     fp 3; fn 13; unsure ham 12; unsure spam 44
->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
-> largest ham & spam cutoffs 0.035 & 0.985
->     fp 3; fn 13; unsure ham 12; unsure spam 44
->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
"""

Compare with the one from chi-square:

"""
-> best cost $48.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.03 & 0.89
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
-> largest ham & spam cutoffs 0.03 & 0.9
->     fp 3; fn 6; unsure ham 12; unsure spam 48
->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
"""

Looks like z-combining has real granularity problems near
the top end.  Trash it.

- Alex