[Spambayes] RE: Central Limit Theorem??!! :)
Gary Robinson
grobinson@transpose.com
Sun, 29 Sep 2002 17:27:02 -0700
> Well, it blows all hell out of the notion that z-scores can be converted to
> probabilities in an obvious way. It leaves us with "two numbers". We've
> been in worse spots than that <wink>.
I *think* that depends on what probability you're talking about. That is, if
the null hypothesis is that a given email is a random collection of words,
then I *think* it DOES correspond to the probability that such an extreme
random collection of words (extreme in the spammy or hammy direction), or a
more extreme one, would have happened by chance alone.
So in that sense, it seems to me that it probably can be useful as a
rigorous probability, if we remain conscious of what null hypothesis we are
operating against.
--Gary
--
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454
> From: Tim Peters <tim.one@comcast.net>
> Date: Sun, 29 Sep 2002 20:15:20 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!! :)
>
> [Tim, agonizes over that his logarithmic z-scores aren't normal]
>
> [Gary Robinson]
>> We know that the choice of extreme words in any given spam is not
>> independent. An email that has some spammy words is likely to have
>> more of them, and an email that has some hammy words is likely to
>> have more of them.
>
> IOW, the sample I'm drawing isn't random, so the clt doesn't really apply.
> I can buy that. Indeed, after the next point, I have to <wink>:
>
>> Either way, it will pull the email to one side or another of the mean.
>
> Turns out this isn't symmetric either. Still restricted to the msgs with at
> least 50 extreme words, only 10.5% of predicted hams had a positive zham,
> but 61.3% of predicted spams had a positive zspam. I guess that says it's
> very much easier to be "very spammy" than it is to be "very hammy", and
> that's consistent with other clues that we've seen.
>
> If I look at all msgs (regardless of how many words they contain) from this
> run, 9.1% of predicted hams had positive zham, and 51.5% of predicted spam
> had positive zspam.
>
> More-- and this may be useful <wink> --whenever the z-score with the smaller
> magnitude was positive, the prediction was always correct.
>
>> SO, I *think* it is very arguable that that is enough to explain
>> the effect you are observing, and that it is not a problem for our
>> purposes.
>
> Well, it blows all hell out of the notion that z-scores can be converted to
> probabilities in an obvious way. It leaves us with "two numbers". We've
> been in worse spots than that <wink>.
>
> I changed my best heuristic cheap stab at guessing certainity to:
>
> certain = False
> if abs(zham) < abs(zspam):
> if zham > 0:
> certain = True
> else:
> ratio = zspam / zham
> else:
> if zspam > 0:
> certain = True
> else:
> ratio = zham / zspam
>
> if not certain:
> ratio = abs(ratio)
> certain = (ratio > 3.0 or
> (n > 30 and ratio > 2.0) or
> (n > 40 and ratio > 1.75))
>
> It turns out that gave exactly the same results as before I noticed the
> "whenever the z-score with the smaller magnitude was positive, the
> prediction was always correct" bit:
>
> for all ham
> 45000 total
> certain 44731 99.402%
> wrong 0 0.000%
> unsure 269 0.598%
> wrong 37 13.755%
>
> for all spam
> 45000 total
> certain 44258 98.351%
> wrong 0 0.000%
> unsure 742 1.649%
> wrong 84 11.321%
>
> The first cutoff can be reduced from 3 to 2.6 without making an error on "a
> certain" one:
>
> for all ham
> 45000 total
> certain 44764 99.476%
> wrong 0 0.000%
> unsure 236 0.524%
> wrong 37 15.678%
>
> for all spam
> 45000 total
> certain 44371 98.602%
> wrong 0 0.000%
> unsure 629 1.398%
> wrong 84 13.355%
>
> but if reduced to 2.5 it starts to screw up:
>
> for all ham
> 45000 total
> certain 44774 99.498%
> wrong 0 0.000%
> unsure 226 0.502%
> wrong 37 16.372%
>
> for all spam
> 45000 total
> certain 44395 98.656%
> wrong 1 0.002%
> unsure 605 1.344%
> wrong 83 13.719%
>
> In the "certain but wrong" case, the heuristic was certain a spam was ham at
> cutoff 2.5:
>
> n = 21
> zham = -6.73
> zspam = -17.12
>
> It's hard to generalize from one example, though <wink>.
>