[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Fri, 27 Sep 2002 13:13:29 -0400


[Neil Schemenauer]
> I don't have anything useful to add except to say that this looks very
> promising.

So far, yes!  The logarithmic variation of Gary's scheme (which I was using)
computes both a ham and a spam statistic from the word probs, and this seems
akin to the gimmick we speculated about earlier for using two "no false
positives" and "no false negatives" scoring gimmicks to grope at some notion
of confidence.

> I'm an idiot when it comes to statistics but even I know
> what 18 standard deviations means.

It means 1 chance in about 10**72.  I've forgotten how many electrons there
are in the universe, although I've long suspected there's really only one
<wink>.

> Also, having a system that generates a "confidence" value in addition
> to a "rating" is a huge bonus.
>
> No time to play with it now, unfortunately.  Maybe tomorrow.

If you can make some time, it's

"""
[Classifier]
use_robinson_probability: True
robinson_probability_x: 0.5
robinson_probability_a: 0.225
max_discriminators: 30
robinson_minimum_prob_strength: 0.1
use_central_limit2: True

[TestDriver]
spam_cutoff: 0.5
"""

The relevant routines in classifier.py are

    central_limit_compute_population_stats2()
    central_limit_spamprob2()

The latter replaces the default spamprob() (enabling use_central_limit2 does
that by magic).  The statistic it computes at the end (the "score" it
returns):

        stat = abs(zham) - abs(zspam)  # > 0 for spam, < 0 for ham

        if stat < -20.0:
            stat = -20.0
        elif stat > 20.0:
            stat = 20.0
        stat = 0.5 + stat / 40.0

is simply looking at the sign bit of the difference between the z-scores,
scaling it in an absurd way to fit in [0., 1.].  These scores don't make any
sense, though, *beyond* mapping - and + to < 0.5 and > 0.5 respectively.

This scheme can't be used with timcv (yet), but can be used with timtest.

Things that need investigating:

+ How to compute a sensible score?
  An honest-to-God probability can be computed from a z-score, and
  I can supply code to do that (although I'd look on google first
  for Python code that can already do that; a simple search over a
  canned sorted list of (abs(zscore), probability) tuples is the
  quickest way to get a good idea).

+ How to represent, compute, and return a measure of confidence?

+ Once that's done, how well does it work?  Does it ever err when
  it's confident?  Since the answer is "yes" <wink>, what are the
  error rates in the region of confidence?  What rates of ham and
  spam is it uncertain about?  If that's high, it may not be usable.
  In my quick tests, it's possible to deduce from the absurdly
  scaled result that it was extremely confident about 3932 of 4000 ham
  and made no errors on those.  Also possible to deduce it was
  extremely confident about 2401 of 2800 spam, and made no errors
  on those.  Overall, it got 0 fp and 9 fn from looking at just
  the sign bit.  The z-scores are both "large" in the cases it
  erred, although it gets to be a delicate decision when the number
  of words is small.  For example, one fn here had only 8 tokens,
  and had z-scores of -3.8 and -5.2.  Neither is astronomically
  unlikely, but both are "quite unlikely".  OTOH, plenty of ham
  and spam have one "quite unlikely" and one "astronomically
  unlikely" z-score each, and it's probably best to be quite
  confident about those.  I simply don't know a good way to combine
  this evidence yet.

+ After a sensible way of computing scores is worked out:
  - What's a better value for robinson_minimum_prob_strength?
    Gary would be happiest if it turned out to be 0 <wink>.
  - What's a better value for max_discriminators?  30 is just an
    educated guess at the minimum that's almost certain to be
    robust.  It may be robust at smaller values.  It may also get
    real benefit from sucking in more clues (bet then shorter
    msgs may become more of a puzzle too).