[SciPy-User] Scipy and statistics: probability density function

Mon Jul 27 16:45:37 EDT 2009

On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern at gmail.com> wrote:
> On Mon, Jul 27, 2009 at 02:33, Daniel J
> Farrell<daniel.farrell at imperial.ac.uk> wrote:
>> Dear list,
>>
>> I am looking for some of the functionality provided by the GNU
>> Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
>> ).
>>
>> In particular, a need to be able to create a probability density
>> function from my histogram of data. This will allow inverse look-ups
>> to be performed, i.e. for a random number (0-->1) find the associated
>> probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-distribution-struct.html
>> ). This allows the distribution and for samples to be returned
>> weighted by the probability of the distribution -- which is a common
>> task!
>
> It looks like you want a CDF (or rather it's inverse, the PPF) rather
> than a PDF. Anyways, this is straightforward. Compute the histogram
> using normed=False. Find the cumulative sum and divide by the sum to
> get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
> will align with the edges array that is also returned. Then you can
> use linear interpolation to do lookups. If you use the edges array as
> "X" and the empirical CDF as "Y", then this is a CDF. If you use the
> empircal CDF array as "X" and the edges as "Y", then this is a PPF.
>
>
> In [12]: x = np.random.uniform(0, 10, size=1000)
>
> In [13]: hist, edges = np.histogram(x)
>
> In [15]: hist
> Out[15]: array([100, 101, 104, 108,  80, 111,  96,  88, 108, 104])
>
> In [16]: edges
> Out[16]:
> array([  3.53879571e-04,   1.00026423e+00,   2.00017458e+00,
>         3.00008493e+00,   3.99999528e+00,   4.99990563e+00,
>         5.99981598e+00,   6.99972633e+00,   7.99963668e+00,
>         8.99954704e+00,   9.99945739e+00])
>
> In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
>
> In [19]: ecdf
> Out[19]:
> array([ 0.   ,  0.1  ,  0.201,  0.305,  0.413,  0.493,  0.604,  0.7  ,
>        0.788,  0.896,  1.   ])
>
> In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
> Out[20]:
> array([ 0.        ,  0.0203746 ,  0.04078459,  0.06119459,  0.08160458,
>        0.10203472,  0.12264881,  0.14326291,  0.163877  ,  0.18449109,
>        0.20522712,  0.22645351,  0.24767991,  0.2689063 ,  0.29013269,
>        0.31160366,  0.33364646,  0.35568925,  0.37773204,  0.39977483,
>        0.41953158,  0.43585957,  0.45218756,  0.46851556,  0.48484355,
>        0.50433802,  0.52699311,  0.54964821,  0.5723033 ,  0.59495839,
>        0.61577382,  0.63536742,  0.65496101,  0.6745546 ,  0.6941482 ,
>        0.71259664,  0.73055743,  0.74851823,  0.76647902,  0.78443982,
>        0.80567348,  0.82771627,  0.84975906,  0.87180185,  0.89384465,
>        0.91515087,  0.93637726,  0.95760365,  0.97883004,  1.        ])
>
> In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
> Out[21]:
> array([  3.53879571e-04,   2.04417216e-01,   4.08480553e-01,
>         6.12543890e-01,   8.16607227e-01,   1.02046852e+00,
>         1.22251143e+00,   1.42455434e+00,   1.62659724e+00,
>         1.82864015e+00,   2.02980301e+00,   2.22601775e+00,
>         2.42223250e+00,   2.61844725e+00,   2.81466200e+00,
>         3.01047705e+00,   3.19942458e+00,   3.38837211e+00,
>         3.57731965e+00,   3.76626718e+00,   3.95521472e+00,
>         4.19462069e+00,   4.44969986e+00,   4.70477903e+00,
>         4.95985820e+00,   5.15488346e+00,   5.33872431e+00,
>         5.52256515e+00,   5.70640600e+00,   5.89024684e+00,
>         6.08569264e+00,   6.29825861e+00,   6.51082459e+00,
>         6.72339057e+00,   6.93595654e+00,   7.16204944e+00,
>         7.39393960e+00,   7.62582975e+00,   7.85771991e+00,
>         8.07294833e+00,   8.26189586e+00,   8.45084340e+00,
>         8.63979093e+00,   8.82873846e+00,   9.01838365e+00,
>         9.21459840e+00,   9.41081315e+00,   9.60702789e+00,
>         9.80324264e+00,   9.99945739e+00])
>

This, together with the answer last week to a piecewise constant pdf,
would make most of the elements for a distribution class, similar to
the generic discrete distribution only for the continuous case.

It could take as constructor either a data array (and use histogram as
in this example, with theoretically optimal bin size) or a set of
points and pdf values as in last weeks example.

Does quad work correctly for integration of a piecewise function?
If yes, then we could just use the generic methods by subclassing rv_continuous.

Is it worth collecting these results?

Josef

> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>