[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Wed, 04 Sep 2002 20:52:14 -0400


[Tim]
> ...
> The first 16 most extreme indicators are split 9 highly in favor of ham
> (.01) and 7 highly in favor of spam (.99).  If I hadn't folded
> case away to let stinking conference announcements through <wink>, I
> expect it would have latched on to the SCREAMING at the start instead of
> looking deeper.  Looking at the To: line probably would nail this one too,
> as "Undisclosed Recipients" has two 0.99 spam indicators right there.
>
> Whatever, you *don't* want to look at msgs with a mix of just
> 0.99 and 0.01 thingies:  it's not all that unusual to get such an
> extreme mix, in spam or ham.

I should have added that it usually gets the right result when this happens.
It's the exceptions to that rule that are mondo embarrassing, because it's
making a mistake then while sitting on a mountain of strong evidence (albeit
pointing as extremely as possible in both directions at once <wink>).

"A problem" is that when a MIN_SPAMPROB and MAX_SPAMPROB clue both appear,
the math is such that they cancel out exactly.  It's *almost* as if neither
existed, but not quite:  they also keep two lower-probability words *out* of
the computation (only a grand total of the MAX_DISCRIMINATORS most extreme
clues are retained).

So I changed spamprob() to keep accepting more clues when MIN/MAX
cancellations are inevitable, and to use the best of those in lieu of the
cancelling extremes.  This turned out to be a pure win:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.075  0.075  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.075  0.025  won
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   1 times
tied 19 times
lost  0 times

total unique fp went from 9 to 7

false negative percentages
    0.909  0.764  won
    0.800  0.691  won
    1.091  0.981  won
    1.381  1.309  won
    1.491  1.418  won
    1.055  0.873  won
    0.945  0.800  won
    1.236  1.163  won
    1.564  1.491  won
    1.200  1.200  tied
    1.454  1.381  won
    1.599  1.454  won
    1.236  1.164  won
    0.800  0.655  won
    0.836  0.655  won
    1.236  1.163  won
    1.236  1.200  won
    1.055  0.982  won
    1.127  0.982  won
    1.381  1.236  won

won  19 times
tied  1 times
lost  0 times

total unique fn went from 284 to 260