Third result ... RE: [Spambayes] First result from Gary Robinson'sideas

Thu, 19 Sep 2002 15:22:11 -0400

[Anthony Baxter]
> >     [Classifier]
> >     use_robinson_probability: True
> >
> >     [TestDriver]
> >     spam_cutoff: 0.50

> false positive percentages
>     0.334  0.445  lost   +33.23%
>     0.278  0.334  lost   +20.14%
>     0.389  0.445  lost   +14.40%
>     0.556  0.556  tied
>     0.278  0.334  lost   +20.14%
>
> won   0 times
> tied  1 times
> lost  4 times
>
> total unique fp went from 33 to 38 lost   +15.15%
> mean fp % went from 0.366932315011 to 0.42251875192 lost   +15.15%
>
> false negative percentages
>     0.582  0.452  won    -22.34%
>     0.775  0.646  won    -16.65%
>     0.710  0.710  tied
>     0.581  0.581  tied
>     0.906  0.647  won    -28.59%
>
> won   3 times
> tied  2 times
> lost  0 times
>
> total unique fn went from 55 to 47 won    -14.55%
> mean fn % went from 0.710811726238 to 0.60736899408 won    -14.55%

Note that for both rates, the "after" run is more eager to call things spam,
and that I had no particular reason to set spam_cutoff to 0.500 exactly
(something "near" 0.5 was clearly best for my first test).

One dead simple way to counteract a tendency to call things spam is simply
to raise the spam_cutoff threshold.  Your second-run score histograms show
that boosting spam_cutoff from 0.500 to 0.525 would have saved 9 false
positives, and added 10 false negatives (using your second-run counts as the
baseline).  Then (using your first-run counts as the baseline) your total
unique fp would have fallen from 33 to 29, and your total unique fn risen
from 55 to 57, and this would be consistent with previous "no real
difference in overall results, but there's an exploitable middle ground"
reports.  OTOH, maybe 0.525016932 would be better still <wink>.

Thanks for the report, Anthony!