[Spambayes] RE: For the bold

Tim Peters tim.one@comcast.net
Sun, 06 Oct 2002 02:14:15 -0400


[Rob Hooft]
>...
> Appended is a pdf containing six histograms made using
> max_discriminators=55
>
> The first one is zham for all ham messages. As you can see, the
> distribution is asymmetric. Furthermore, a simple average and standard
> deviation calculation results in a bell curve that does not follow the
> important tail of the histogram: the chances will be severely
> underestimated by these parameters.

Two things.  First, the raw spam score (smean) of a msg is the natural log
of the geometric mean of the extreme-word spamprobs.  This statistic can
never be positive, has no theoretical bound on how low it can go, and is
typically a small negative number, around -0.12.  It's simply impossible to
get a raw score "much larger" (much more positive) than that, but easy to
get one much smaller (much more negative), so I think the asymmetry is
inevitable.

The raw ham score (hmean) is similar, but uses the log of the geometric mean
of 1-prob, and is typically farther away from 0.0, nearer -0.33.  That gives
more room for larger scores to exist (remember that it can never be
positive!), and I expect that's why the first stab at fitting a bell curve
to the ham worked better than for the spam, despite that both were poor
fits.

All this may well be why the original use_central_limit scheme (which uses
the straight mean of the word spamprobs -- no logs, no geometric means, no
two-way prob vs 1-prob scoring) worked better for me under your scheme in my
tests:  that's got no fundamental reason (as far as I can see) to be *so*
lopsided; indeed, the mean and median of hmean are very close under
use_central_limit, and likewise for smean.  This isn't true under the other
central limit schemes.  They're still lopsided, though; here from an
original central limit run:

ham ham mean: 6000 items; mean 0.18; sdev 0.09
-> <stat> min 0.00620435; median 0.183251; max 0.840666

spam spam mean: 6000 items; mean 0.93; sdev 0.07
-> <stat> min 0.486362; median 0.950825; max 0.996632

The ham mean can't get below 0 under that scheme, and 0 is just two sdevs
away from the ham-mean mean ~= the ham-mean median.

The spam mean can't get above 1.0 under that scheme, and 1.0 is just one
sdev removed the spam-mean mean ~= (but less so) the spam-mean median.  So
here again, fitting the ham in a bell curve is easier than fitting the spam.


Second, there's no real justification for the way zscores are computed in
the classifier code now.  You may get better results if you ignore the
zscores in the pickle, and work directly with the raw hmean and smean scores
instead (which are also in the binary pickle saved by clgen).  They're the
actual data here, and the zscores are a distorted version that factor in n
(the number of extreme words) in a way that doesn't make real sense.  Note
that n is also in the clgen pickle tuples:  all the relevant info is there,
except for the individual word probabilities used.

> The second one is abs(zham) for all ham messages. The bell curve fits
> this histogram much better!

Since use_central_limit2 and use_central_limit3 produce inherently and
highly lopsided distributions, I think that makes good sense.