[Spambayes] RE: Central Limit Theorem??!! :)
Tim Peters
tim.one@comcast.net
Fri, 27 Sep 2002 13:13:29 -0400
[Neil Schemenauer]
> I don't have anything useful to add except to say that this looks very
> promising.
So far, yes! The logarithmic variation of Gary's scheme (which I was using)
computes both a ham and a spam statistic from the word probs, and this seems
akin to the gimmick we speculated about earlier for using two "no false
positives" and "no false negatives" scoring gimmicks to grope at some notion
of confidence.
> I'm an idiot when it comes to statistics but even I know
> what 18 standard deviations means.
It means 1 chance in about 10**72. I've forgotten how many electrons there
are in the universe, although I've long suspected there's really only one
<wink>.
> Also, having a system that generates a "confidence" value in addition
> to a "rating" is a huge bonus.
>
> No time to play with it now, unfortunately. Maybe tomorrow.
If you can make some time, it's
"""
[Classifier]
use_robinson_probability: True
robinson_probability_x: 0.5
robinson_probability_a: 0.225
max_discriminators: 30
robinson_minimum_prob_strength: 0.1
use_central_limit2: True
[TestDriver]
spam_cutoff: 0.5
"""
The relevant routines in classifier.py are
central_limit_compute_population_stats2()
central_limit_spamprob2()
The latter replaces the default spamprob() (enabling use_central_limit2 does
that by magic). The statistic it computes at the end (the "score" it
returns):
stat = abs(zham) - abs(zspam) # > 0 for spam, < 0 for ham
if stat < -20.0:
stat = -20.0
elif stat > 20.0:
stat = 20.0
stat = 0.5 + stat / 40.0
is simply looking at the sign bit of the difference between the z-scores,
scaling it in an absurd way to fit in [0., 1.]. These scores don't make any
sense, though, *beyond* mapping - and + to < 0.5 and > 0.5 respectively.
This scheme can't be used with timcv (yet), but can be used with timtest.
Things that need investigating:
+ How to compute a sensible score?
An honest-to-God probability can be computed from a z-score, and
I can supply code to do that (although I'd look on google first
for Python code that can already do that; a simple search over a
canned sorted list of (abs(zscore), probability) tuples is the
quickest way to get a good idea).
+ How to represent, compute, and return a measure of confidence?
+ Once that's done, how well does it work? Does it ever err when
it's confident? Since the answer is "yes" <wink>, what are the
error rates in the region of confidence? What rates of ham and
spam is it uncertain about? If that's high, it may not be usable.
In my quick tests, it's possible to deduce from the absurdly
scaled result that it was extremely confident about 3932 of 4000 ham
and made no errors on those. Also possible to deduce it was
extremely confident about 2401 of 2800 spam, and made no errors
on those. Overall, it got 0 fp and 9 fn from looking at just
the sign bit. The z-scores are both "large" in the cases it
erred, although it gets to be a delicate decision when the number
of words is small. For example, one fn here had only 8 tokens,
and had z-scores of -3.8 and -5.2. Neither is astronomically
unlikely, but both are "quite unlikely". OTOH, plenty of ham
and spam have one "quite unlikely" and one "astronomically
unlikely" z-score each, and it's probably best to be quite
confident about those. I simply don't know a good way to combine
this evidence yet.
+ After a sensible way of computing scores is worked out:
- What's a better value for robinson_minimum_prob_strength?
Gary would be happiest if it turned out to be 0 <wink>.
- What's a better value for max_discriminators? 30 is just an
educated guess at the minimum that's almost certain to be
robust. It may be robust at smaller values. It may also get
real benefit from sucking in more clues (bet then shorter
msgs may become more of a puzzle too).