So I tried to use stats.histogram and I believe I found a bug, but I also have a bigger wtf with the algorithm. If I create a random vector in [0,1). I get uneven counts. Here it is: #################################################### import scipy from scipy.stats import histogram from random import random n = 100 randArray = [random() for i in range(n)] answer = histogram(randArray) print answer #################################################### gives (array([15, 17, 22, 20, 20, 6, 0, 0, 0, 0]), -0.094840236656285076, 0.20868271373704711, 0) Notice that the last entries in the histogram are all zeros. This is always the case. Also notice the bin width is about .2 which is approximately double what it should be. I have traced the error and in the stats.py module the code estbinwidth = float(Max - Min)/float(numbins) + 1 binsize = (Max-Min+estbinwidth)/float(numbins) computes the bin size incorrectly. In particular the +1 in estbinwidth needs to be in parentheses. You would only notice this for small data ranges which is perhaps why it was never noticed. Personally, I would just compute the binsize in one step as binsize = (numbins+1)(Max-Min+estbinwidth)/numbins**2. But now for the wtf part. The histogram centers the lowest and highest bins around the lowest and highest point in the data as witnessed in the code lowerreallimit = Min - binsize/2.0 and the fact that the bin size isn't just (max - min)/n. Why would you possibly want to do this? With this technique, if you histogram a sample of uniform random variates, then the outer two bins will have half the counts of the other bins because only half the bin is within range. There must be a reason for doing this, but I sure don't know what it is. D __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
participants (1)
-
danny shevitz