On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern@gmail.com> wrote:
On Mon, Jul 27, 2009 at 02:33, Daniel J Farrell<daniel.farrell@imperial.ac.uk> wrote:
Dear list,
I am looking for some of the functionality provided by the GNU Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html ).
In particular, a need to be able to create a probability density function from my histogram of data. This will allow inverse look-ups to be performed, i.e. for a random number (0-->1) find the associated probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-d... ). This allows the distribution and for samples to be returned weighted by the probability of the distribution -- which is a common task!
It looks like you want a CDF (or rather it's inverse, the PPF) rather than a PDF. Anyways, this is straightforward. Compute the histogram using normed=False. Find the cumulative sum and divide by the sum to get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this will align with the edges array that is also returned. Then you can use linear interpolation to do lookups. If you use the edges array as "X" and the empirical CDF as "Y", then this is a CDF. If you use the empircal CDF array as "X" and the edges as "Y", then this is a PPF.
In [12]: x = np.random.uniform(0, 10, size=1000)
In [13]: hist, edges = np.histogram(x)
In [15]: hist Out[15]: array([100, 101, 104, 108, 80, 111, 96, 88, 108, 104])
In [16]: edges Out[16]: array([ 3.53879571e-04, 1.00026423e+00, 2.00017458e+00, 3.00008493e+00, 3.99999528e+00, 4.99990563e+00, 5.99981598e+00, 6.99972633e+00, 7.99963668e+00, 8.99954704e+00, 9.99945739e+00])
In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
In [19]: ecdf Out[19]: array([ 0. , 0.1 , 0.201, 0.305, 0.413, 0.493, 0.604, 0.7 , 0.788, 0.896, 1. ])
In [20]: np.interp(np.linspace(0, 10), edges, ecdf) Out[20]: array([ 0. , 0.0203746 , 0.04078459, 0.06119459, 0.08160458, 0.10203472, 0.12264881, 0.14326291, 0.163877 , 0.18449109, 0.20522712, 0.22645351, 0.24767991, 0.2689063 , 0.29013269, 0.31160366, 0.33364646, 0.35568925, 0.37773204, 0.39977483, 0.41953158, 0.43585957, 0.45218756, 0.46851556, 0.48484355, 0.50433802, 0.52699311, 0.54964821, 0.5723033 , 0.59495839, 0.61577382, 0.63536742, 0.65496101, 0.6745546 , 0.6941482 , 0.71259664, 0.73055743, 0.74851823, 0.76647902, 0.78443982, 0.80567348, 0.82771627, 0.84975906, 0.87180185, 0.89384465, 0.91515087, 0.93637726, 0.95760365, 0.97883004, 1. ])
In [21]: np.interp(np.linspace(0, 1), ecdf, edges) Out[21]: array([ 3.53879571e-04, 2.04417216e-01, 4.08480553e-01, 6.12543890e-01, 8.16607227e-01, 1.02046852e+00, 1.22251143e+00, 1.42455434e+00, 1.62659724e+00, 1.82864015e+00, 2.02980301e+00, 2.22601775e+00, 2.42223250e+00, 2.61844725e+00, 2.81466200e+00, 3.01047705e+00, 3.19942458e+00, 3.38837211e+00, 3.57731965e+00, 3.76626718e+00, 3.95521472e+00, 4.19462069e+00, 4.44969986e+00, 4.70477903e+00, 4.95985820e+00, 5.15488346e+00, 5.33872431e+00, 5.52256515e+00, 5.70640600e+00, 5.89024684e+00, 6.08569264e+00, 6.29825861e+00, 6.51082459e+00, 6.72339057e+00, 6.93595654e+00, 7.16204944e+00, 7.39393960e+00, 7.62582975e+00, 7.85771991e+00, 8.07294833e+00, 8.26189586e+00, 8.45084340e+00, 8.63979093e+00, 8.82873846e+00, 9.01838365e+00, 9.21459840e+00, 9.41081315e+00, 9.60702789e+00, 9.80324264e+00, 9.99945739e+00])
This, together with the answer last week to a piecewise constant pdf, would make most of the elements for a distribution class, similar to the generic discrete distribution only for the continuous case. It could take as constructor either a data array (and use histogram as in this example, with theoretically optimal bin size) or a set of points and pdf values as in last weeks example. Does quad work correctly for integration of a piecewise function? If yes, then we could just use the generic methods by subclassing rv_continuous. Is it worth collecting these results? Josef
-- Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user