[Spambayes] 5% points in statistics

Rob W. W. Hooft rob@hooft.net
Thu Oct 17 14:18:31 2002


This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
I added 5% and 95% points to the statistics in Histogram.py. The 
calculation is similar to a "median": a median is the 50% point. This 
has as effect:

-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100

So indeed this reveals new information about the distributions: where 
"sdev" for ham and spam are very similar, the fivepct{lo,hi} values show 
that the distributions are NOT the same width. 95% of ham is 20 times 
tighter than 95% of spam.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
Index: Histogram.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Histogram.py,v
retrieving revision 1.5
diff -u -r1.5 Histogram.py
--- Histogram.py	8 Oct 2002 18:13:49 -0000	1.5
+++ Histogram.py	17 Oct 2002 13:13:41 -0000
@@ -28,6 +28,8 @@
     #     min       smallest value in collection
     #     max       largest value in collection
     #     median    midpoint
+    #     fivepctlo five percent of data is lower than this
+    #     fivepcthi five percent of data is higher than this
     #     mean
     #     var       variance
     #     sdev      population standard deviation (sqrt(variance))
@@ -47,6 +49,14 @@
             self.median = data[n // 2]
         else:
             self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0
+	xfivepct = 0.05 * (n-1)
+	frac = xfivepct % 1.0
+	self.fivepctlo = (data[int(xfivepct)] * (1 - frac) + 
+                         data[int(xfivepct)+1] * frac)
+	xfivepct = 0.95 * (n-1)
+	frac=xfivepct % 1.0
+	self.fivepcthi = (data[int(xfivepct)] * (1 - frac) + 
+                         data[int(xfivepct) + 1] * frac)
         # Compute mean.
         # Add in increasing order of magnitude, to minimize roundoff error.
         if data[0] < 0.0:
@@ -124,6 +134,8 @@
         print "-> <stat> min %g; median %g; max %g" % (self.min,
                                                        self.median,
                                                        self.max)
+        print "-> <stat> fivepctlo %g; fivepcthi %g" % (self.fivepctlo,
+						      self.fivepcthi)
         lo, hi = self.get_lo_hi()
         if lo > hi:
             return

---------------------- multipart/mixed attachment--