[Spambayes] z

Gary Robinson grobinson@transpose.com
Tue Oct 15 21:50:43 2002


Urgh. Sorry. I am so totally swamped with work that I am only quickly
looking in sometimes and I think I got a wrong impression before.

Based on what you say in the message quoted below, I think you're already
doing what I was hoping for, with the exception of the ranking part! I guess
I was confused by the earlier message...

And I also agree that it doesn't make sense to try ranking now because there
are aspects to this data that mean it won't come out  to a uniform
distribution under a reasonable null hypothesis without more tweaking than I
(or, I guess, any of us) can suggest a way to do at this point.

--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Tue, 15 Oct 2002 16:05:33 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>
> Subject: RE: [Spambayes] z
> 
> [Tim]
>>> If Rob is feeling particularly adventurous, it would be interesting (in
>>> conncection with z-combining) to transform the database spamprobs into
>>> unit-normalized zscores via his RMS black magic, as an extra
>>> step at the endof update_probabilities().  This wouldn't require another
> 
> [Gary Robinson]
>> I didn't realize that this wasn't already being done.
> 
> It's unclear to me what "this" means.  RMS transformations?  No, we're not
> doing those here.
> 
>> Yes I would recommend that somebody do this because I don't think we're
>> really testing the z approach completely fairly until it is.
> 
> You tell me whether this is this <wink>; this is the code people have been
> using:
> 
>   def z_spamprob(self, wordstream, evidence=False):
>       from math import sqrt
> 
>       clues = self._getclues(wordstream)
>       zsum = 0.0
>       for prob, word, record in clues:
>           if record is not None:  # else wordinfo doesn't know about it
>               record.killcount += 1
>           zsum += normIP(prob)
> 
>       n = len(clues)
>       if n:
>           # We've added n zscores from a unit normal distribution.  By the
>           # central limit theorem, their mean is normally distributed with
>           # mean 0 and sdev 1/sqrt(n).  So the zscore of zsum/n is
>           # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
>           prob = normP(zsum / sqrt(n))
>       else:
>           prob = 0.5
> 
> normIP() maps a probability p to the real z such that the area under the
> unit Gaussian from -inf to z is p.  normP() is the inverse, mapping real z
> to the area under the unit Gaussian from -inf to z.  Example:
> 
>>>> normIP(.9)
> 1.2815502653713151
>>>> normP(_)
> 0.8999997718215671
>>>> normIP(.1)
> -1.2815502653713149
>>>> normP(_)
> 0.10000022817843296
>>>> 
> 
> normP() is accurate to about 14 decimal digits; normIP() is accurate to
> about 6 decimal digits.
> 
> The word "prob" values here are your f(w).
> 
>> I'm not saying I believe that the z approach will turn out to be
>> better -- I just don't know -- but it seems worth trying.
> 
> Happy to try, but really don't know how to proceed.  There's seems no reason
> to believe that the f(w) values lead to normIP() values that are *in fact*
> unit-normal distributed on a random collection of words, and I don't
> actually see a reason to believe that this would get closer to being true if
> the f(w) were ranked first.
> 
> If we can define precisely what we mean by "a random collection of words",
> the idea that the resulting normIP() values are or aren't unit-normal
> distributed seems easily testable, though.
>