[Spambayes] 5% points in statistics
Rob W. W. Hooft
rob@hooft.net
Thu Oct 17 14:18:31 2002
This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
I added 5% and 95% points to the statistics in Histogram.py. The
calculation is similar to a "median": a median is the 50% point. This
has as effect:
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100
So indeed this reveals new information about the distributions: where
"sdev" for ham and spam are very similar, the fivepct{lo,hi} values show
that the distributions are NOT the same width. 95% of ham is 20 times
tighter than 95% of spam.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
---------------------- multipart/mixed attachment
Index: Histogram.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Histogram.py,v
retrieving revision 1.5
diff -u -r1.5 Histogram.py
--- Histogram.py 8 Oct 2002 18:13:49 -0000 1.5
+++ Histogram.py 17 Oct 2002 13:13:41 -0000
@@ -28,6 +28,8 @@
# min smallest value in collection
# max largest value in collection
# median midpoint
+ # fivepctlo five percent of data is lower than this
+ # fivepcthi five percent of data is higher than this
# mean
# var variance
# sdev population standard deviation (sqrt(variance))
@@ -47,6 +49,14 @@
self.median = data[n // 2]
else:
self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0
+ xfivepct = 0.05 * (n-1)
+ frac = xfivepct % 1.0
+ self.fivepctlo = (data[int(xfivepct)] * (1 - frac) +
+ data[int(xfivepct)+1] * frac)
+ xfivepct = 0.95 * (n-1)
+ frac=xfivepct % 1.0
+ self.fivepcthi = (data[int(xfivepct)] * (1 - frac) +
+ data[int(xfivepct) + 1] * frac)
# Compute mean.
# Add in increasing order of magnitude, to minimize roundoff error.
if data[0] < 0.0:
@@ -124,6 +134,8 @@
print "-> <stat> min %g; median %g; max %g" % (self.min,
self.median,
self.max)
+ print "-> <stat> fivepctlo %g; fivepcthi %g" % (self.fivepctlo,
+ self.fivepcthi)
lo, hi = self.get_lo_hi()
if lo > hi:
return
---------------------- multipart/mixed attachment--