[Spambayes] RE: Central Limit Theorem??!! :)

Mon, 23 Sep 2002 08:46:10 -0400

OK. That's fascinating.

Remember that the multiplicative method in S, which calculates the geometric
mean of the f(w)'s, stresses the MOST extreme values more than the less
extreme ones. The more extreme the value is, the more it is stressed. Very
extreme ones are stressed very highly, in an exponentially compounding way
if there are several really extreme ones.

That's why it's the basis for that 1971 optimality theorem that I kept
trying to invoke more strongly, but which is still invoked to a degree in S.

That's the reasoning behind S in the first place, and why it works so well,
and it also happens in Graham's original approach (but only on one side),
but we are completely losing it in R.

In R, we are trading that very powerful multiplicative effect away in order
to get the benefit of real parametric statistics. ALSO a very powerful
technique but apparently slightly less powerful in this application -- at
least when used alone.

If there is a performance loss in R (and there are no remaining coding
errors), I am confident that that's why.

THERE IS A POTENTIAL FIX FOR THIS LOSSAGE, so that we can theoretically get
the best of both totally different techniques.

When training on the spam side, don't use f(w), use: ln (1-f(w)).

When training on the ham side, don't use f(w), use: ln f(w).

Same when testing. Don't add the f(w)'s increating the sample mean; add the
expressions above, and divide by n. So the spam side uses ln (1-f(w)) both
for training and calculating the sample means, and the ham side uses ln f(w)
for both.

As we've discussed, averaging the ln's is the same thing as a geometric mean
if you then subsequently raise e to the power of that computed average. But
we don't do that last step here. We just feed the arithmetic mean of the
ln's into the z-score calcs.

This *should* bring us the benefits of the multiplicative approach and of
the parametric stats approach at the same time.

The downside of this is that the ln's make such a skewed distribution that
it should take a bigger n to make the central limit theorem kick in. BUT,
OTOH, it WILL kick in, and something like 30 may really still be enough (it
usually kicks in at significantly smaller numbers). And you've also
successfully used n=150 with f(w) and that is DEFINITELY enough.

The other downside is that it just seems like a bit  of a wild thing to do
and sometimes when you do wild things, strange reasons emerge why they won't
work. But I really can't see any at this point as long as n is big enough
that the sample means take on a normal distribution.

THANKS for doing all the coding work to test this idea!!!!  :)

Gary

> 
> This was using 30 extremes, and using Graham's p(w) (complete with hambias
> 2, minprob .01 and maxprob .99).  The f-n rate was more than 10x worse using
> f(w) with a=x=0.5, and I have no idea why yet (and we're *generally* having
> problems with f-n rates on smaller training sets when using f(w), whether
> using the central-limit scoring, or Gary's previous scoring; perhaps 'a'
> needs to be much smaller than 0.5 -- there's too much to test here).
> 
> Here are the aggregate scaled R values (clamped to [-20, 20], and then
> scaled linearly into [0, 1]):
> 
> Ham distribution for all runs:
> 5000 items; mean 0.35; sample sdev 3.62
> * = 82 items
> 0.00 4907 ************************************************************
> 2.50   13 *
> 5.00   14 *
> 7.50   12 *
> 10.00    4 *
> 12.50   11 *
> 15.00    6 *
> 17.50    7 *
> 20.00    4 *
> 22.50    2 *
> 25.00    2 *
> 27.50    3 *
> 30.00    0
> 32.50    1 *
> 35.00    3 *
> 37.50    0
> 40.00    1 *
> 42.50    1 *
> 45.00    1 *
> 47.50    1 *
> 50.00    1 *
> 52.50    0
> 55.00    1 *
> 57.50    0
> 60.00    0
> 62.50    1 *
> 65.00    2 *
> 67.50    0
> 70.00    0
> 72.50    0
> 75.00    0
> 77.50    0
> 80.00    0
> 82.50    0
> 85.00    0
> 87.50    0
> 90.00    0
> 92.50    0
> 95.00    0
> 97.50    2 *
> 
> Spam distribution for all runs:
> 5000 items; mean 98.97; sample sdev 5.87
> * = 80 items
> 0.00    0
> 2.50    0
> 5.00    0
> 7.50    0
> 10.00    0
> 12.50    0
> 15.00    0
> 17.50    0
> 20.00    1 *
> 22.50    0
> 25.00    0
> 27.50    3 *
> 30.00    1 *
> 32.50    1 *
> 35.00    1 *
> 37.50    1 *
> 40.00    1 *
> 42.50    0
> 45.00    1 *
> 47.50    3 *
> 50.00    6 *
> 52.50    5 *
> 55.00    5 *
> 57.50    3 *
> 60.00    7 *
> 62.50    6 *
> 65.00    9 *
> 67.50   11 *
> 70.00   19 *
> 72.50   11 *
> 75.00    8 *
> 77.50   10 *
> 80.00   13 *
> 82.50    9 *
> 85.00   11 *
> 87.50   16 *
> 90.00   31 *
> 92.50    8 *
> 95.00   15 *
> 97.50 4784 ************************************************************
> 
> All of the false positives had significant numbers of both 0.01 and 0.99
> clues.  This seems to be a reappearance of the p(w) "cancellation disease"
> that we wormed around before by adding gobs of special-case code to Graham
> scoring.  Several of the fns also had this problem.  The outcome is like
> flipping a coin when this happens.  Note that f(w) doesn't have this problem
> (it's mostly an artifact of that p(w) artificially clamps probabilities, and
> so many words end up with probs at the extreme values).
> 
> Here are the means and variances of the training data scaled R values:
> 
> hammean  0.0315194110198 hamvar  0.0102392908745
> spammean 0.977596060549  spamvar 0.00629493144389
> 
> hammean  0.0289322128628 hamvar  0.00860263576484
> spammean 0.976784463455  spamvar 0.00703754635535
> 
> hammean  0.0292168061706 hamvar  0.00850922282341
> spammean 0.977330456163  spamvar 0.00656386045426
> 
> hammean  0.0292418489626 hamvar  0.00869327431102
> spammean 0.972324957985  spamvar 0.00968199258783
> 
> hammean  0.0266295579103 hamvar  0.00745682458391
> spammean 0.974432833096  spamvar 0.00865812701944
> 
> 
> Finally, here's the same thing (including exactly the same messages in the
> training and prediction sets) all over again, *except* using f(w) with a=0.1
> and x=0.5 (I mentioned a=0.5 above; I lowered it again for this run, and
> that did help the f-n rate, but not much):
> 
>     0.000   3.700
> 0 new false positives
> 37 new false negatives
> 
>     0.000   2.500
> 0 new false positives
> 25 new false negatives
> 
>     0.000   4.800
> 0 new false positives
> 48 new false negatives
> 
>     0.100   2.900
> 1 new false positives
> 29 new false negatives
> 
>     0.100   3.400
> 1 new false positives
> 34 new false negatives
> 
> total unique false pos 2
> total unique false neg 173
> average fp % 0.04
> average fn % 3.46
> 
> Ham distribution for all runs:
> 5000 items; mean 0.05; sample sdev 1.57
> * = 84 items
> 0.00 4991 ************************************************************
> 2.50    2 *
> 5.00    0
> 7.50    1 *
> 10.00    0
> 12.50    0
> 15.00    0
> 17.50    0
> 20.00    1 *
> 22.50    1 *
> 25.00    1 *
> 27.50    1 *
> 30.00    0
> 32.50    0
> 35.00    0
> 37.50    0
> 40.00    0
> 42.50    0
> 45.00    0
> 47.50    0
> 50.00    0
> 52.50    0
> 55.00    0
> 57.50    1 *
> 60.00    0
> 62.50    0
> 65.00    0
> 67.50    0
> 70.00    0
> 72.50    0
> 75.00    0
> 77.50    1 *
> 80.00    0
> 82.50    0
> 85.00    0
> 87.50    0
> 90.00    0
> 92.50    0
> 95.00    0
> 97.50    0
> 
> Spam distribution for all runs:
> 5000 items; mean 94.82; sample sdev 15.16
> * = 69 items
> 0.00    5 *
> 2.50    2 *
> 5.00    1 *
> 7.50    5 *
> 10.00    8 *
> 12.50    6 *
> 15.00   10 *
> 17.50    8 *
> 20.00    6 *
> 22.50   14 *
> 25.00   10 *
> 27.50   11 *
> 30.00    9 *
> 32.50    4 *
> 35.00   12 *
> 37.50   12 *
> 40.00   11 *
> 42.50    8 *
> 45.00   16 *
> 47.50   15 *
> 50.00   21 *
> 52.50   21 *
> 55.00   20 *
> 57.50   27 *
> 60.00   21 *
> 62.50   19 *
> 65.00   31 *
> 67.50   18 *
> 70.00   18 *
> 72.50   22 *
> 75.00   22 *
> 77.50   37 *
> 80.00   36 *
> 82.50   50 *
> 85.00   59 *
> 87.50   50 *
> 90.00   67 *
> 92.50   74 **
> 95.00   76 **
> 97.50 4138 ************************************************************
> 
> Too bizarre for me -- there may be a gross bug here, but the central-limit
> code is exactly the same in both cases, and the f(w) code is exactly the
> same as I've been using with good results for a few days.
> 
> hammean  0.083575863227  hamvar 0.0384359627039
> spammean 0.978952388668 spamvar 0.00372443106446
> 
> hammean  0.075515986459  hamvar 0.0331126798862
> spammean 0.97742506528  spamvar 0.00420869219347
> 
> hammean  0.0776481612081 hamvar 0.0338362239317
> spammean 0.978462136207 spamvar 0.00341505065601
> 
> hammean  0.07972071882   hamvar 0.0355833143776
> spammean 0.974508870296 spamvar 0.00530574730638
> 
> hammean  0.0713015705881 hamvar 0.0303987810665
> spammean 0.976734052416 spamvar 0.00468795224229
> 
> For whatever reason(s), I note that the ham variances are higher here than
> when using p(w), and the spam variances lower.  Perhaps that's just due to
> that f(w) doesn't have an artificial ham bias.  OTOH, the prediction set ham
> distribution is much tighter when using the unbiased f(w), while the spam
> distribution is much looser.
>