[Spambayes] RE: Central Limit Theorem??!! :)

Sun, 29 Sep 2002 17:27:02 -0700

> Well, it blows all hell out of the notion that z-scores can be converted to
> probabilities in an obvious way.  It leaves us with "two numbers".  We've
> been in worse spots than that <wink>.

I *think* that depends on what probability you're talking about. That is, if
the null hypothesis is that a given email is a random collection of words,
then I *think* it DOES correspond to the probability that such an extreme
random collection of words (extreme in the spammy or hammy direction), or a
more extreme one, would have happened by chance alone.

So in that sense, it seems to me that it probably can be useful as a
rigorous probability, if we remain conscious of what null hypothesis we are
operating against. 

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454

> From: Tim Peters <tim.one@comcast.net>
> Date: Sun, 29 Sep 2002 20:15:20 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!!     :)
> 
> [Tim, agonizes over that his logarithmic z-scores aren't normal]
> 
> [Gary Robinson]
>> We know that the choice of extreme words in any given spam is not
>> independent. An email that has some spammy words is likely to have
>> more of them, and an email that has some hammy words is likely to
>> have more of them.
> 
> IOW, the sample I'm drawing isn't random, so the clt doesn't really apply.
> I can buy that.  Indeed, after the next point, I have to <wink>:
> 
>> Either way, it will pull the email to one side or another of the mean.
> 
> Turns out this isn't symmetric either.  Still restricted to the msgs with at
> least 50 extreme words, only 10.5% of predicted hams had a positive zham,
> but 61.3% of predicted spams had a positive zspam.  I guess that says it's
> very much easier to be "very spammy" than it is to be "very hammy", and
> that's consistent with other clues that we've seen.
> 
> If I look at all msgs (regardless of how many words they contain) from this
> run, 9.1% of predicted hams had positive zham, and 51.5% of predicted spam
> had positive zspam.
> 
> More-- and this may be useful <wink> --whenever the z-score with the smaller
> magnitude was positive, the prediction was always correct.
> 
>> SO, I *think* it is very arguable that that is enough to explain
>> the effect you are observing, and that it is not a problem for our
>> purposes.
> 
> Well, it blows all hell out of the notion that z-scores can be converted to
> probabilities in an obvious way.  It leaves us with "two numbers".  We've
> been in worse spots than that <wink>.
> 
> I changed my best heuristic cheap stab at guessing certainity to:
> 
>       certain = False
>       if abs(zham) < abs(zspam):
>           if zham > 0:
>               certain = True
>           else:
>               ratio = zspam / zham
>       else:
>           if zspam > 0:
>               certain = True
>           else:
>               ratio = zham / zspam
> 
>       if not certain:
>           ratio = abs(ratio)
>           certain = (ratio > 3.0 or
>                      (n > 30 and ratio > 2.0) or
>                      (n > 40 and ratio > 1.75))
> 
> It turns out that gave exactly the same results as before I noticed the
> "whenever the z-score with the smaller magnitude was positive, the
> prediction was always correct" bit:
> 
> for all ham
>   45000 total
>   certain    44731 99.402%
>       wrong      0  0.000%
>   unsure       269  0.598%
>       wrong     37 13.755%
> 
> for all spam
>   45000 total
>   certain    44258 98.351%
>       wrong      0  0.000%
>   unsure       742  1.649%
>       wrong     84 11.321%
> 
> The first cutoff can be reduced from 3 to 2.6 without making an error on "a
> certain" one:
> 
> for all ham
>   45000 total
>   certain    44764 99.476%
>       wrong      0  0.000%
>   unsure       236  0.524%
>       wrong     37 15.678%
> 
> for all spam
>   45000 total
>   certain    44371 98.602%
>       wrong      0  0.000%
>   unsure       629  1.398%
>       wrong     84 13.355%
> 
> but if reduced to 2.5 it starts to screw up:
> 
> for all ham
>   45000 total
>   certain    44774 99.498%
>       wrong      0  0.000%
>   unsure       226  0.502%
>       wrong     37 16.372%
> 
> for all spam
>   45000 total
>   certain    44395 98.656%
>       wrong      1  0.002%
>   unsure       605  1.344%
>       wrong     83 13.719%
> 
> In the "certain but wrong" case, the heuristic was certain a spam was ham at
> cutoff 2.5:
> 
>   n = 21
>   zham  =  -6.73
>   zspam = -17.12
> 
> It's hard to generalize from one example, though <wink>.
>