[Spambayes] Move closer to Gary's ideal

Fri, 20 Sep 2002 23:29:19 -0400

I've checked some new code in for the adventurous.  To try it, you can do

"""
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
[TestDriver]
spam_cutoff: 0.50
"""

"1500" was my lazy way of spelling infinity; for now, the code uses
math.frexp() to simulate unbounded dynamic float range instead of bothering
with logarithms; this also means the database entries are exactly the same
as they were before.  I left max_discriminators working because I suspect
we're going to want it again.

These options have no effect when the above is enabled:

"""
[Classifier]
hambias
spambias
min_spamprob
max_spamprob
"""

I hate all of those, so good riddance if they go <wink>.

Other options you may want to play with, but I don't recommend it unless
you've read the source material and think you know what you're doing:

"""
[Classifier]
# This one has no effect for now (it's easy to do, I just haven't gotten
# to it yet).
use_robinson_ranking: False

# The "a" parameter in Gary's prob adjustment.
robinson_probability_a = 1.0

# Likewise the "x" parameter -- it's like our current UNKNOWN_SPAMPROB.
robinson_probability_x = 0.5
"""

I'm still recovering from my corpus screwup and don't have a lot to say
about this yet.  Overall it seems to be doing as well as the all-default
scheme (our highly tuned and heavily fiddled Graham scheme)!  If it works
better than that, I won't be able to tell from my data (the all-default
scheme was working "too well" for me to demonstrate an improvement if one
were made).  I did notice it nail some difficult false negatives I don't
think the minprob/maxprob-hobbled Graham scheme would ever be able to nail.
So all signs are good so far, except maybe one:

There's one surprising/maybe-disturbing thing I've seen on all my little
random-subset runs (which are all I've run so far, interleaved with
re-cleaning my corpus):  there's not only "a middle ground" now, it's
essentially ALL "middle ground"!  Scores that's aren't due to my corpus
pollution are virtually all within 20 points of 50.  Here's a typical
histogram pair from a 10-fold c-v run on a random subset of 1000 ham and
1000 spam; the 6 lowest-scoring oddballs in the spam distro were in fact
bogus false negatives due to my corpus screwup (so picture those dots as
belong in the ham histogram instead):

Ham distribution for all runs:
* = 6 items
  0.00   0
  2.50   0
  5.00   0
  7.50   0
 10.00   0
 12.50   0
 15.00   0
 17.50   0
 20.00   0
 22.50   0
 25.00   0
 27.50   6 *
 30.00  35 ******
 32.50  98 *****************
 35.00 221 *************************************
 37.50 307 ****************************************************
 40.00 229 ***************************************
 42.50  78 *************
 45.00  20 ****
 47.50   5 *
 50.00   1 *
 52.50   0
 55.00   0
 57.50   0
 60.00   0
 62.50   0
 65.00   0
 67.50   0
 70.00   0
 72.50   0
 75.00   0
 77.50   0
 80.00   0
 82.50   0
 85.00   0
 87.50   0
 90.00   0
 92.50   0
 95.00   0
 97.50   0

Spam distribution for all runs:
* = 6 items
  0.00   0
  2.50   0
  5.00   0
  7.50   0
 10.00   0
 12.50   0
 15.00   0
 17.50   0
 20.00   0
 22.50   0
 25.00   0
 27.50   0
 30.00   0
 32.50   1 *
 35.00   1 *
 37.50   1 *
 40.00   1 *
 42.50   1 *
 45.00   1 *
 47.50   1 *
 50.00  28 *****
 52.50  64 ***********
 55.00 184 *******************************
 57.50 352 ***********************************************************
 60.00 295 **************************************************
 62.50  69 ************
 65.00   1 *
 67.50   0
 70.00   0
 72.50   0
 75.00   0
 77.50   0
 80.00   0
 82.50   0
 85.00   0
 87.50   0
 90.00   0
 92.50   0
 95.00   0
 97.50   0

I have to do other things now, but if anyone wants to play with what I
*would* do if I could <wink>, play with max_discriminators and see whether
reducing that helps spread this out.  A suspicion is that folding in endless
quantities of garbage words (spamprob so close to 0.5 that they're not
really clues at all) may be dragging everything toward 0.5 without giving a
real benefit.

The math in Graham's combining scheme is such that a prob 0.5 word has no
effect whatsoever on the outcome.  The math in Gary's combining scheme
doesn't appear to have the same property:  I believe adding a .5 prob word
there moves the outcome closer to neutrality.