[Spambayes] There Can Be Only One
Tim Peters
tim.one@comcast.net
Wed, 25 Sep 2002 01:25:51 -0400
[Guido]
> ...
> The 60.0 bins in the histogram have 2 hams and 7 spams, so moving the
> cutoff to 0.625 would have made it a tie for fps and a loss by 1 for
> fns.
This is too painful. There's a new option now, enabled by default:
"""
[TestDriver]
# When compute_best_cutoffs_from_histograms is enabled, after the
# display of a ham+spam histogram pair, a listing is given of all
# the cutoff scores (coinciding with a histogram boundary) that
# minimize the total number of misclassified messages (false
# positives + false negatives).
compute_best_cutoffs_from_histograms: True
"""
Other definitions of "best" are certainly possible. You may wish to
increase nbuckets too (to increase the resolution of the automated
analysis).
The output looks like this(*):
-> best cutoff for this pair: 0.425
-> with 1 fp + 0 fn = 1 mistakes
-> matched at 0.45 (0 fp + 1 fn)
-> matched at 0.475 (0 fp + 1 fn)
-> matched at 0.5 (0 fp + 1 fn)
You can be sure then that the total errors are higher at all other histogram
boundary points.
> ...
> I guess spams for magic diets are common.
Exceedingly! Remember the time one of my persistent false positives was a
long message from Alex Martelli, discussing the quality of water needed in
the preparation of pasta. He got nailed more by the "water retention" parts
of diet spams than by the "water sports" parts of porn spams. None of these
have much to do with Python, either <wink>.
(*) So nobody else bothers, here's comparing my f(w) run compared to the
same thing but setting robinson_probability_x to 0.1 (the "unknown word"
prob -- I set it to a silly value just to make sure it mattered -- if you
consider unknown words to be strong ham indicators, then there should be a
push toward calling all things ham, so the f-p rate should go down and the
f-n rate up; and so it is):
false positive percentages
0.500 0.000 won -100.00%
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 1 times
tied 9 times
lost 0 times
total unique fp went from 1 to 0 won -100.00%
mean fp % went from 0.05 to 0.0 won -100.00%
false negative percentages
0.000 1.500 lost +(was 0)
0.000 2.000 lost +(was 0)
0.000 3.000 lost +(was 0)
0.000 1.000 lost +(was 0)
0.000 1.500 lost +(was 0)
0.000 1.500 lost +(was 0)
0.000 1.500 lost +(was 0)
0.000 2.000 lost +(was 0)
0.000 1.000 lost +(was 0)
0.000 2.000 lost +(was 0)
won 0 times
tied 0 times
lost 10 times
total unique fn went from 0 to 34 lost +(was 0)
mean fn % went from 0.0 to 1.7 lost +(was 0)
ham mean ham sdev
33.01 28.66 -13.18% 6.26 5.49 -12.30%
32.19 27.90 -13.33% 5.38 5.02 -6.69%
32.99 28.62 -13.25% 5.60 5.16 -7.86%
33.46 29.14 -12.91% 5.77 5.30 -8.15%
33.16 28.70 -13.45% 5.56 5.11 -8.09%
32.81 28.69 -12.56% 5.72 5.20 -9.09%
33.38 29.15 -12.67% 5.76 5.30 -7.99%
32.55 28.24 -13.24% 5.70 5.25 -7.89%
33.11 28.73 -13.23% 5.52 5.08 -7.97%
34.21 29.74 -13.07% 5.84 5.39 -7.71%
ham mean and sdev for all runs
33.09 28.76 -13.09% 5.73 5.25 -8.38%
spam mean spam sdev
82.95 71.96 -13.25% 6.82 7.26 +6.45%
82.17 71.80 -12.62% 6.34 7.63 +20.35%
82.06 71.37 -13.03% 6.14 7.34 +19.54%
82.39 71.95 -12.67% 5.93 6.72 +13.32%
82.53 72.43 -12.24% 7.00 7.56 +8.00%
82.76 71.97 -13.04% 6.56 7.30 +11.28%
82.06 71.55 -12.81% 5.73 6.90 +20.42%
82.26 72.49 -11.88% 5.97 6.96 +16.58%
82.65 72.90 -11.80% 6.71 7.54 +12.37%
83.43 73.03 -12.47% 6.37 7.95 +24.80%
spam mean and sdev for all runs
82.53 72.14 -12.59% 6.37 7.34 +15.23%
ham/spam mean difference: 49.44 43.38 -6.06
Now if I *were* to keep such a silly change, the last histogram analysis at
least tells me how to minimize the damage:
-> best cutoff for all runs: 0.475
-> with 2 fp + 6 fn = 8 mistakes
Since the change was in the direction of calling more things ham, this makes
good sense: reducing spam_cutoff works in the direction of calling more
things spam.