[Spambayes] z
Gary Robinson
grobinson@transpose.com
Tue Oct 15 21:50:43 2002
Urgh. Sorry. I am so totally swamped with work that I am only quickly
looking in sometimes and I think I got a wrong impression before.
Based on what you say in the message quoted below, I think you're already
doing what I was hoping for, with the exception of the ranking part! I guess
I was confused by the earlier message...
And I also agree that it doesn't make sense to try ranking now because there
are aspects to this data that mean it won't come out to a uniform
distribution under a reasonable null hypothesis without more tweaking than I
(or, I guess, any of us) can suggest a way to do at this point.
--Gary
--
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454
> From: Tim Peters <tim.one@comcast.net>
> Date: Tue, 15 Oct 2002 16:05:33 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>
> Subject: RE: [Spambayes] z
>
> [Tim]
>>> If Rob is feeling particularly adventurous, it would be interesting (in
>>> conncection with z-combining) to transform the database spamprobs into
>>> unit-normalized zscores via his RMS black magic, as an extra
>>> step at the endof update_probabilities(). This wouldn't require another
>
> [Gary Robinson]
>> I didn't realize that this wasn't already being done.
>
> It's unclear to me what "this" means. RMS transformations? No, we're not
> doing those here.
>
>> Yes I would recommend that somebody do this because I don't think we're
>> really testing the z approach completely fairly until it is.
>
> You tell me whether this is this <wink>; this is the code people have been
> using:
>
> def z_spamprob(self, wordstream, evidence=False):
> from math import sqrt
>
> clues = self._getclues(wordstream)
> zsum = 0.0
> for prob, word, record in clues:
> if record is not None: # else wordinfo doesn't know about it
> record.killcount += 1
> zsum += normIP(prob)
>
> n = len(clues)
> if n:
> # We've added n zscores from a unit normal distribution. By the
> # central limit theorem, their mean is normally distributed with
> # mean 0 and sdev 1/sqrt(n). So the zscore of zsum/n is
> # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
> prob = normP(zsum / sqrt(n))
> else:
> prob = 0.5
>
> normIP() maps a probability p to the real z such that the area under the
> unit Gaussian from -inf to z is p. normP() is the inverse, mapping real z
> to the area under the unit Gaussian from -inf to z. Example:
>
>>>> normIP(.9)
> 1.2815502653713151
>>>> normP(_)
> 0.8999997718215671
>>>> normIP(.1)
> -1.2815502653713149
>>>> normP(_)
> 0.10000022817843296
>>>>
>
> normP() is accurate to about 14 decimal digits; normIP() is accurate to
> about 6 decimal digits.
>
> The word "prob" values here are your f(w).
>
>> I'm not saying I believe that the z approach will turn out to be
>> better -- I just don't know -- but it seems worth trying.
>
> Happy to try, but really don't know how to proceed. There's seems no reason
> to believe that the f(w) values lead to normIP() values that are *in fact*
> unit-normal distributed on a random collection of words, and I don't
> actually see a reason to believe that this would get closer to being true if
> the f(w) were ranked first.
>
> If we can define precisely what we mean by "a random collection of words",
> the idea that the resulting normIP() values are or aren't unit-normal
> distributed seems easily testable, though.
>