... For whatever reason, setting HAMBIAS to 1.0 seems to produce worse
It's remarkable. Graham's scheme is pasted together out of all sorts of things that shouldn't work <wink>, but this one seems the most mysterious.
It has a huge effect in my 5x5 c.l.py test grid. Combining all unique msgs identified as false negative or false positive across all 20 test runs,
At HAMBIAS = 1.0 total false negatives goes down by a factor of 2 (337 -> 166) total false positives goes up by a factor of 7.6 (23 -> 174)
and some of the false positives are just amazing -- David Ascher announcing a Python conference, Laura Creighton pontificating about the GPL, ... it's hard to fathom! One innocuous example:
""" Hello, I love all these speed debates but if speed were our only concern we would all be writing in assembly for all non internet based programs...!
Thank you, Vincent A. Primavera
prob = 0.99918657946 prob('only') = 0.645419 prob('would') = 0.349237 prob('hello,') = 0.342435 prob('assembly') = 0.34891 prob('thank') = 0.819611 prob('these') = 0.677099 prob('all') = 0.709966 prob('you,') = 0.803672 prob('concern') = 0.225352 prob('our') = 0.951928 prob('internet') = 0.942274 prob('speed') = 0.305927 prob('but') = 0.229635 prob('love') = 0.736116 prob('non') = 0.885065 prob('writing') = 0.150994 """
There's not a lot going on in that msg! *Perhaps* the primary effect of boosting HAMBIAS is to take common glue words (like 'these' and 'all') out of this uniquely "only look at smoking guns" scoring scheme altogether? I don't know what "sense" there is in letting 'these' vote in favor of spam, for example.
At HAMBIAS = 3.0 total false negatives goes up by a factor of 2.08 (337 -> 702) total false positives goes down by a factor of 4.6 (23 -> 5)
Somebody else think about this <wink>. It's certainly the easiest knob to twiddle to make a false-positive versus false-negative rate tradeoff.