[Spambayes] Move closer to Gary's ideal
Tim Peters
tim.one@comcast.net
Fri, 20 Sep 2002 23:29:19 -0400
I've checked some new code in for the adventurous. To try it, you can do
"""
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
[TestDriver]
spam_cutoff: 0.50
"""
"1500" was my lazy way of spelling infinity; for now, the code uses
math.frexp() to simulate unbounded dynamic float range instead of bothering
with logarithms; this also means the database entries are exactly the same
as they were before. I left max_discriminators working because I suspect
we're going to want it again.
These options have no effect when the above is enabled:
"""
[Classifier]
hambias
spambias
min_spamprob
max_spamprob
"""
I hate all of those, so good riddance if they go <wink>.
Other options you may want to play with, but I don't recommend it unless
you've read the source material and think you know what you're doing:
"""
[Classifier]
# This one has no effect for now (it's easy to do, I just haven't gotten
# to it yet).
use_robinson_ranking: False
# The "a" parameter in Gary's prob adjustment.
robinson_probability_a = 1.0
# Likewise the "x" parameter -- it's like our current UNKNOWN_SPAMPROB.
robinson_probability_x = 0.5
"""
I'm still recovering from my corpus screwup and don't have a lot to say
about this yet. Overall it seems to be doing as well as the all-default
scheme (our highly tuned and heavily fiddled Graham scheme)! If it works
better than that, I won't be able to tell from my data (the all-default
scheme was working "too well" for me to demonstrate an improvement if one
were made). I did notice it nail some difficult false negatives I don't
think the minprob/maxprob-hobbled Graham scheme would ever be able to nail.
So all signs are good so far, except maybe one:
There's one surprising/maybe-disturbing thing I've seen on all my little
random-subset runs (which are all I've run so far, interleaved with
re-cleaning my corpus): there's not only "a middle ground" now, it's
essentially ALL "middle ground"! Scores that's aren't due to my corpus
pollution are virtually all within 20 points of 50. Here's a typical
histogram pair from a 10-fold c-v run on a random subset of 1000 ham and
1000 spam; the 6 lowest-scoring oddballs in the spam distro were in fact
bogus false negatives due to my corpus screwup (so picture those dots as
belong in the ham histogram instead):
Ham distribution for all runs:
* = 6 items
0.00 0
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 6 *
30.00 35 ******
32.50 98 *****************
35.00 221 *************************************
37.50 307 ****************************************************
40.00 229 ***************************************
42.50 78 *************
45.00 20 ****
47.50 5 *
50.00 1 *
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 0
Spam distribution for all runs:
* = 6 items
0.00 0
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 1 *
35.00 1 *
37.50 1 *
40.00 1 *
42.50 1 *
45.00 1 *
47.50 1 *
50.00 28 *****
52.50 64 ***********
55.00 184 *******************************
57.50 352 ***********************************************************
60.00 295 **************************************************
62.50 69 ************
65.00 1 *
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 0
I have to do other things now, but if anyone wants to play with what I
*would* do if I could <wink>, play with max_discriminators and see whether
reducing that helps spread this out. A suspicion is that folding in endless
quantities of garbage words (spamprob so close to 0.5 that they're not
really clues at all) may be dragging everything toward 0.5 without giving a
real benefit.
The math in Graham's combining scheme is such that a prob 0.5 word has no
effect whatsoever on the outcome. The math in Gary's combining scheme
doesn't appear to have the same property: I believe adding a .5 prob word
there moves the outcome closer to neutrality.