[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 12:07:18 -0400


[Tim]
> This is a log-central-limit experiment with a 10x10 grid.  500 ham
> and 500 spam selected at random from each of my sets, lumped into 10
> pairs.  Then
>
> train on pair 1, predict on pairs 2 thru 10
> train on pair 2, predict on pairs 1, and 3 thru 10
> ...
> train on pair 10, predict on pairs 1 thru 9
>
> In all, that's 90 prediction runs on 1000 msgs per run, for 90,000 total
> predictions.  Each of the 10*1000 = 10,000 unique msgs is predicted 9
> times.
>
> This is a hard test ...

Then I made this up as a measure of "certainty":

> ...
>         if min(abs(zham), abs(zspam)) < 10.0 and abs(stat) > 5.0:
>             stat = stat > 0.0 and 1.0 or 0.0
>         else:
>             if stat > 0.0:
>                 stat = 0.51
>             else:
>                 stat = 0.49
>
> IOW, it's certain iff at least one z-score is "not insanely
> large", and the other z-score is "substantially larger".  This is
> purely a hack just to see what would happen.

On a different run than I reported on, here's the bottom line:

for all ham
    45000 total
    certain    44020 97.822%
        wrong      0  0.000%
    unsure       980  2.178%
        wrong     37  3.776%

for all spam
    45000 total
    certain    44151 98.113%
        wrong      8  0.018%
    unsure       849  1.887%
        wrong     76  8.952%

Staring at the errors suggested a different measure of certainty:

        zham = abs(zham)
        zspam = abs(zspam)
        ratio = max(zham, zspam) / min(zham, zspam)
        certain = ratio > 3.0 or (n > 30 and ratio > 2.0)

n is the number of extreme words found in the msg.  If it doesn't exceed 30,
appeal to the central limit theorem is dubious.  So, in that case, I require
a larger ratio.  On the same data, the bottom line changes to:

for all ham
    45000 total
    certain    44698 99.329%
        wrong      0  0.000%
    unsure       302  0.671%
        wrong     37 12.252%
for all spam
    45000 total
    certain    44166 98.147%
        wrong      0  0.000%
    unsure       834  1.853%
        wrong     84 10.072%

IOW, it's certain more often (much more often for ham!), never made a
mistake when it was certain, and made mistakes at higher rates when it
wasn't certain.

The lesson I take is that somebody who actually knows what they're doing
could get something mondo useful out of this.

I noted that adding another clause

     ... or (n > 40 and ratio > 1.75)

changed the bottom line to

for all ham
    45000 total
    certain    44731 99.402%
        wrong      0  0.000%
    unsure       269  0.598%
        wrong     37 13.755%
for all spam
    45000 total
    certain    44258 98.351%
        wrong      0  0.000%
    unsure       742  1.649%
        wrong     84 11.321%

I've no idea how much useful info I'm leaving untouched.