[Python-Dev] The first trustworthy <wink> GBayes results
Tim Peters
tim.one@comcast.net
Wed, 04 Sep 2002 20:52:14 -0400
[Tim]
> ...
> The first 16 most extreme indicators are split 9 highly in favor of ham
> (.01) and 7 highly in favor of spam (.99). If I hadn't folded
> case away to let stinking conference announcements through <wink>, I
> expect it would have latched on to the SCREAMING at the start instead of
> looking deeper. Looking at the To: line probably would nail this one too,
> as "Undisclosed Recipients" has two 0.99 spam indicators right there.
>
> Whatever, you *don't* want to look at msgs with a mix of just
> 0.99 and 0.01 thingies: it's not all that unusual to get such an
> extreme mix, in spam or ham.
I should have added that it usually gets the right result when this happens.
It's the exceptions to that rule that are mondo embarrassing, because it's
making a mistake then while sitting on a mountain of strong evidence (albeit
pointing as extremely as possible in both directions at once <wink>).
"A problem" is that when a MIN_SPAMPROB and MAX_SPAMPROB clue both appear,
the math is such that they cancel out exactly. It's *almost* as if neither
existed, but not quite: they also keep two lower-probability words *out* of
the computation (only a grand total of the MAX_DISCRIMINATORS most extreme
clues are retained).
So I changed spamprob() to keep accepting more clues when MIN/MAX
cancellations are inevitable, and to use the best of those in lieu of the
cancelling extremes. This turned out to be a pure win:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.075 0.025 won
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 19 times
lost 0 times
total unique fp went from 9 to 7
false negative percentages
0.909 0.764 won
0.800 0.691 won
1.091 0.981 won
1.381 1.309 won
1.491 1.418 won
1.055 0.873 won
0.945 0.800 won
1.236 1.163 won
1.564 1.491 won
1.200 1.200 tied
1.454 1.381 won
1.599 1.454 won
1.236 1.164 won
0.800 0.655 won
0.836 0.655 won
1.236 1.163 won
1.236 1.200 won
1.055 0.982 won
1.127 0.982 won
1.381 1.236 won
won 19 times
tied 1 times
lost 0 times
total unique fn went from 284 to 260