[Spambayes] RE: Central Limit Theorem??!! :)

Sun, 29 Sep 2002 20:15:20 -0400

[Tim, agonizes over that his logarithmic z-scores aren't normal]

[Gary Robinson]
> We know that the choice of extreme words in any given spam is not
> independent. An email that has some spammy words is likely to have
> more of them, and an email that has some hammy words is likely to
> have more of them.

IOW, the sample I'm drawing isn't random, so the clt doesn't really apply.
I can buy that.  Indeed, after the next point, I have to <wink>:

> Either way, it will pull the email to one side or another of the mean.

Turns out this isn't symmetric either.  Still restricted to the msgs with at
least 50 extreme words, only 10.5% of predicted hams had a positive zham,
but 61.3% of predicted spams had a positive zspam.  I guess that says it's
very much easier to be "very spammy" than it is to be "very hammy", and
that's consistent with other clues that we've seen.

If I look at all msgs (regardless of how many words they contain) from this
run, 9.1% of predicted hams had positive zham, and 51.5% of predicted spam
had positive zspam.

More-- and this may be useful <wink> --whenever the z-score with the smaller
magnitude was positive, the prediction was always correct.

> SO, I *think* it is very arguable that that is enough to explain
> the effect you are observing, and that it is not a problem for our
> purposes.

Well, it blows all hell out of the notion that z-scores can be converted to
probabilities in an obvious way.  It leaves us with "two numbers".  We've
been in worse spots than that <wink>.

I changed my best heuristic cheap stab at guessing certainity to:

        certain = False
        if abs(zham) < abs(zspam):
            if zham > 0:
                certain = True
            else:
                ratio = zspam / zham
        else:
            if zspam > 0:
                certain = True
            else:
                ratio = zham / zspam

        if not certain:
            ratio = abs(ratio)
            certain = (ratio > 3.0 or
                       (n > 30 and ratio > 2.0) or
                       (n > 40 and ratio > 1.75))

It turns out that gave exactly the same results as before I noticed the
"whenever the z-score with the smaller magnitude was positive, the
prediction was always correct" bit:

for all ham
    45000 total
    certain    44731 99.402%
        wrong      0  0.000%
    unsure       269  0.598%
        wrong     37 13.755%

for all spam
    45000 total
    certain    44258 98.351%
        wrong      0  0.000%
    unsure       742  1.649%
        wrong     84 11.321%

The first cutoff can be reduced from 3 to 2.6 without making an error on "a
certain" one:

for all ham
    45000 total
    certain    44764 99.476%
        wrong      0  0.000%
    unsure       236  0.524%
        wrong     37 15.678%

for all spam
    45000 total
    certain    44371 98.602%
        wrong      0  0.000%
    unsure       629  1.398%
        wrong     84 13.355%

but if reduced to 2.5 it starts to screw up:

for all ham
    45000 total
    certain    44774 99.498%
        wrong      0  0.000%
    unsure       226  0.502%
        wrong     37 16.372%

for all spam
    45000 total
    certain    44395 98.656%
        wrong      1  0.002%
    unsure       605  1.344%
        wrong     83 13.719%

In the "certain but wrong" case, the heuristic was certain a spam was ham at
cutoff 2.5:

    n = 21
    zham  =  -6.73
    zspam = -17.12

It's hard to generalize from one example, though <wink>.