[Python-Dev] The first trustworthy <wink> GBayes results
Tim Peters
tim.one@comcast.net
Tue, 03 Sep 2002 15:08:57 -0400
[Charles Cazabon]
> I would love to see how the results would be affected by applying
> the scoring scheme to the entire content of the message, instead of
> just the 15 (or 16 in your case) most extreme samples.
Then it would be close to a classic Bayesian classifier, and like any such
would need entirely different scoring code to avoid catastrophic
floating-point errors (right now an intermediate result can't become smaller
than 0.01**16 = 1e-32, so fp troubles are impossible; raise the exponent to
a measly 200 and you're already out of the range of IEEE double precision;
classic classifiers word in logarithm space instead for this reason). You
can read lots of papers on how those do; all evidence suggests they do worse
than this scheme on the spam versus non-spam task.
> By the way, you never said why you increased that number by one;
It's explained in the comment block preceding the MAX_DISCRIMINATORS
definition.
BTW, in an unreported experiment I boosted MAX_DISCRIMINATORS to 36. I
don't recall what happened now, but it was a disaster for at least one of
the error rates.
> did it make that much difference?
Not on average. It helped eliminate a narrow class of false positives,
where previously the first 15 extremes the classifier saw had 8 probs of .99
and 7 of .01. That works out to "spam". Making the # of classifiers even
instead allowed for graceful ties, which favor ham in this scheme. All
previous decisions "should be" revisited after each new change, though, and
in this particular case it could well be that stipping HTML tags out of
plain-text messages also addressed the same narrow issue but in a more
effective way (without some special gimmick, virtually every message
including so much as an example of HTML got scored as spam).