[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Tim Peters tim.one@comcast.net
Sun, 01 Sep 2002 03:04:44 -0400


[Neil Schemenauer]
> ...
> For whatever reason, setting HAMBIAS to 1.0 seems to produce worse
results.

It's remarkable.  Graham's scheme is pasted together out of all sorts of
things that shouldn't work <wink>, but this one seems the most mysterious.

It has a huge effect in my 5x5 c.l.py test grid.  Combining all unique msgs
identified as false negative or false positive across all 20 test runs,

At HAMBIAS = 1.0
    total false negatives goes down by a factor of 2 (337 -> 166)
    total false positives goes up by a factor of 7.6 (23 -> 174)

and some of the false positives are just amazing -- David Ascher announcing
a Python conference, Laura Creighton pontificating about the GPL, ... it's
hard to fathom!  One innocuous example:

"""
Hello,
        I love all these speed debates but if speed were our only concern we
would all be writing in assembly for all non internet based programs...!

        Thank you,
        Vincent A. Primavera

prob = 0.99918657946
prob('only') = 0.645419
prob('would') = 0.349237
prob('hello,') = 0.342435
prob('assembly') = 0.34891
prob('thank') = 0.819611
prob('these') = 0.677099
prob('all') = 0.709966
prob('you,') = 0.803672
prob('concern') = 0.225352
prob('our') = 0.951928
prob('internet') = 0.942274
prob('speed') = 0.305927
prob('but') = 0.229635
prob('love') = 0.736116
prob('non') = 0.885065
prob('writing') = 0.150994
"""

There's not a lot going on in that msg!  *Perhaps* the primary effect of
boosting HAMBIAS is to take common glue words (like 'these' and 'all') out
of this uniquely "only look at smoking guns" scoring scheme altogether?  I
don't know what "sense" there is in letting 'these' vote in favor of spam,
for example.

At HAMBIAS = 3.0
    total false negatives goes up by a factor of 2.08 (337 -> 702)
    total false positives goes down by a factor of 4.6 (23 -> 5)

Somebody else think about this <wink>.  It's certainly the easiest knob to
twiddle to make a false-positive versus false-negative rate tradeoff.