[Spambayes] RE: Central Limit Theorem??!! :)

Sun, 29 Sep 2002 14:35:40 -0700

We know that the choice of extreme words in any given spam is not
independent. An email that has some spammy words is likely to have more of
them, and an email that has some hammy words is likely to have more of them.

Either way, it will pull the email to one side or another of the mean.

SO, I *think* it is very arguable that that is enough to explain the effect
you are observing, and that it is not a problem for our purposes.

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454

> From: Tim Peters <tim.one@comcast.net>
> Date: Sun, 29 Sep 2002 15:55:49 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!!     :)
> 
> Something odd I noticed in my log central-limit code:  the z-scores looked
> "too big", but maybe I'm confusing myself.
> 
> I was running with max_discriminators=50.  Of the 45,000 each of ham and
> spam predicted against, 27,094 ham had at least 50 extreme words, and 36,491
> spam had at least 50 extreme words.  In what follows I'm looking solely at
> those, and 50 should be plenty big enough for the theorem to kick in.
> 
> What I *expected* is, that in
> 
>       zham  = (hmean - self.hammean) / sqrt(self.hamvar / n)
>       zspam = (smean - self.spammean) / sqrt(self.spamvar / n)
> 
> where
>   n is the number of extreme words (always 50 here)
>   hmean is the mean of the n message ln(1-p) values
>   smean is the mean of the n message ln(p) values
>   hammean is the ham training population mean ln(1-p)
>   hamvar  is the ham training population ln(1-p) variance
>   spammean is the spam training population mean ln(p)
>   spamvar  is the spam training population ln(p) variance
> 
> that zham would be approximately normally distributed (mean 0, sdev 1) when
> predicting a ham, and zspam similarly when predicting a spam.
> 
> But it's nowhere near that:
> 
> When predicting a ham,
> 
> This % of hams   had abs(zham) <= this
> --------------   ---------------------
> 18.377%           1.0
> 36.525%           2.0
> 53.650%           3.0
> 67.919%           4.0
> 78.301%           5.0
> 85.831%           6.0
> 90.788%           7.0
> 93.762%           8.0
> 95.696%           9.0
> 97.044%          10.0
> 
> and when predicting a spam,
> 
> This % of spams  had abs(zspam) <= this
> ---------------  ----------------------
> 20.337%           1.0
> 44.120%           2.0
> 69.732%           3.0
> 89.559%           4.0
> 92.702%           5.0
> 94.656%           6.0
> 96.191%           7.0
> 97.205%           8.0
> 97.964%           9.0
> 98.539%          10.0
> 
> I *expected* that about 68% of messages would have abs(z) <= 1.0, and about
> 95% <= 2.0, etc.  What I'm seeing is more like a linear relationship!
> 
> So I'm confused, or something's very fishy here.  I've stared at the code
> until my eyes bled, but it still looks like it's doing everything right, and
> I'm still using scaled unbounded integers to compute the population means
> and variances so there's no possibility of roundoff error polluting the
> results.
> 
> Got a clue?  It's very curious that the 68% and 95% I was looking for show
> up at 4 and 8-9 in the ham results, and at 3 and 6-7 in the spam results.
>