[Spambayes] Current histograms

Tim Peters tim.one@comcast.net
Mon, 09 Sep 2002 23:18:25 -0400


We've not only reduced the f-p and f-n rates in my test runs, we've also
made the score distributions substantially sharper.  This is bad news for
Greg, because the non-existent "middle ground" is becoming even less
existent <wink>:

Ham distribution for all runs:
* = 1333 items
  0.00 79975 ************************************************************
  2.50     1 *
  5.00     0
  7.50     0
 10.00     2 *
 12.50     1 *
 15.00     0
 17.50     0
 20.00     0
 22.50     1 *
 25.00     0
 27.50     0
 30.00     0
 32.50     0
 35.00     0
 37.50     1 *
 40.00     0
 42.50     0
 45.00     0
 47.50     0
 50.00     0
 52.50     0
 55.00     0
 57.50     0
 60.00     1 *
 62.50     0
 65.00     1 *
 67.50     0
 70.00     0
 72.50     0
 75.00     0
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     0
 90.00     0
 92.50     0
 95.00     0
 97.50    17 *

Spam distribution for all runs:
* = 914 items
  0.00   118 *
  2.50     7 *
  5.00     0
  7.50     2 *
 10.00     1 *
 12.50     1 *
 15.00     3 *
 17.50     1 *
 20.00     1 *
 22.50     1 *
 25.00     0
 27.50     0
 30.00     4 *
 32.50     3 *
 35.00     4 *
 37.50     2 *
 40.00     0
 42.50     1 *
 45.00     1 *
 47.50     0
 50.00     2 *
 52.50     3 *
 55.00     1 *
 57.50     2 *
 60.00     0
 62.50     1 *
 65.00     1 *
 67.50    10 *
 70.00     2 *
 72.50     1 *
 75.00     2 *
 77.50     1 *
 80.00     0
 82.50     0
 85.00     1 *
 87.50     4 *
 90.00     2 *
 92.50     5 *
 95.00    14 *
 97.50 54806 ************************************************************

As usual for me, this is an aggregate of 20 runs, each both training and
predicting on 4000 c.l.py ham + ~2750 BruceG spam.

Only 25 ham scores out of 80,000 are above 0.025 now (and, yes, the
"Nigerian scam"-quoting msg is still counted as ham -- I haven't taken
anything out of the ham corpus since remving the "If AOL were a car" spam),
the f-p rate wouldn't have changed at all if the spamprob cutoff were
dropped from 0.90 to 0.675, dropping the cutoff to 0.40 would have added
only 2 false positives, and dropping it to 0.15 would have added only
another 2 more!

It's spooky.