[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Wed, 25 Sep 2002 01:25:51 -0400


[Guido]
> ...
> The 60.0 bins in the histogram have 2 hams and 7 spams, so moving the
> cutoff to 0.625 would have made it a tie for fps and a loss by 1 for
> fns.

This is too painful.  There's a new option now, enabled by default:

"""
[TestDriver]
# When compute_best_cutoffs_from_histograms is enabled, after the
# display of a ham+spam histogram pair, a listing is given of all
# the cutoff scores (coinciding with a histogram boundary) that
# minimize the total number of misclassified messages (false
# positives + false negatives).
compute_best_cutoffs_from_histograms: True
"""

Other definitions of "best" are certainly possible.  You may wish to
increase nbuckets too (to increase the resolution of the automated
analysis).

The output looks like this(*):

-> best cutoff for this pair: 0.425
->     with 1 fp + 0 fn = 1 mistakes
->     matched at 0.45 (0 fp + 1 fn)
->     matched at 0.475 (0 fp + 1 fn)
->     matched at 0.5 (0 fp + 1 fn)

You can be sure then that the total errors are higher at all other histogram
boundary points.

> ...
> I guess spams for magic diets are common.

Exceedingly!  Remember the time one of my persistent false positives was a
long message from Alex Martelli, discussing the quality of water needed in
the preparation of pasta.  He got nailed more by the "water retention" parts
of diet spams than by the "water sports" parts of porn spams.  None of these
have much to do with Python, either <wink>.


(*) So nobody else bothers, here's comparing my f(w) run compared to the
same thing but setting robinson_probability_x to 0.1 (the "unknown word"
prob -- I set it to a silly value just to make sure it mattered -- if you
consider unknown words to be strong ham indicators, then there should be a
push toward calling all things ham, so the f-p rate should go down and the
f-n rate up; and so it is):

false positive percentages
    0.500  0.000  won   -100.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   1 times
tied  9 times
lost  0 times

total unique fp went from 1 to 0 won   -100.00%
mean fp % went from 0.05 to 0.0 won   -100.00%

false negative percentages
    0.000  1.500  lost  +(was 0)
    0.000  2.000  lost  +(was 0)
    0.000  3.000  lost  +(was 0)
    0.000  1.000  lost  +(was 0)
    0.000  1.500  lost  +(was 0)
    0.000  1.500  lost  +(was 0)
    0.000  1.500  lost  +(was 0)
    0.000  2.000  lost  +(was 0)
    0.000  1.000  lost  +(was 0)
    0.000  2.000  lost  +(was 0)

won   0 times
tied  0 times
lost 10 times

total unique fn went from 0 to 34 lost  +(was 0)
mean fn % went from 0.0 to 1.7 lost  +(was 0)

ham mean                     ham sdev
  33.01   28.66  -13.18%        6.26    5.49  -12.30%
  32.19   27.90  -13.33%        5.38    5.02   -6.69%
  32.99   28.62  -13.25%        5.60    5.16   -7.86%
  33.46   29.14  -12.91%        5.77    5.30   -8.15%
  33.16   28.70  -13.45%        5.56    5.11   -8.09%
  32.81   28.69  -12.56%        5.72    5.20   -9.09%
  33.38   29.15  -12.67%        5.76    5.30   -7.99%
  32.55   28.24  -13.24%        5.70    5.25   -7.89%
  33.11   28.73  -13.23%        5.52    5.08   -7.97%
  34.21   29.74  -13.07%        5.84    5.39   -7.71%

ham mean and sdev for all runs
  33.09   28.76  -13.09%        5.73    5.25   -8.38%

spam mean                    spam sdev
  82.95   71.96  -13.25%        6.82    7.26   +6.45%
  82.17   71.80  -12.62%        6.34    7.63  +20.35%
  82.06   71.37  -13.03%        6.14    7.34  +19.54%
  82.39   71.95  -12.67%        5.93    6.72  +13.32%
  82.53   72.43  -12.24%        7.00    7.56   +8.00%
  82.76   71.97  -13.04%        6.56    7.30  +11.28%
  82.06   71.55  -12.81%        5.73    6.90  +20.42%
  82.26   72.49  -11.88%        5.97    6.96  +16.58%
  82.65   72.90  -11.80%        6.71    7.54  +12.37%
  83.43   73.03  -12.47%        6.37    7.95  +24.80%

spam mean and sdev for all runs
  82.53   72.14  -12.59%        6.37    7.34  +15.23%

ham/spam mean difference: 49.44 43.38 -6.06

Now if I *were* to keep such a silly change, the last histogram analysis at
least tells me how to minimize the damage:

-> best cutoff for all runs: 0.475
->     with 2 fp + 6 fn = 8 mistakes

Since the change was in the direction of calling more things ham, this makes
good sense:  reducing spam_cutoff works in the direction of calling more
things spam.