[SciPy-User] Scipy and statistics: probability density function

Mon Aug 3 20:56:59 EDT 2009

On Mon, Jul 27, 2009 at 15:45, <josef.pktd at gmail.com> wrote:
> On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern at gmail.com> wrote:
>> On Mon, Jul 27, 2009 at 02:33, Daniel J
>> Farrell<daniel.farrell at imperial.ac.uk> wrote:
>>> Dear list,
>>>
>>> I am looking for some of the functionality provided by the GNU
>>> Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
>>> ).
>>>
>>> In particular, a need to be able to create a probability density
>>> function from my histogram of data. This will allow inverse look-ups
>>> to be performed, i.e. for a random number (0-->1) find the associated
>>> probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-distribution-struct.html
>>> ). This allows the distribution and for samples to be returned
>>> weighted by the probability of the distribution -- which is a common
>>> task!
>>
>> It looks like you want a CDF (or rather it's inverse, the PPF) rather
>> than a PDF. Anyways, this is straightforward. Compute the histogram
>> using normed=False. Find the cumulative sum and divide by the sum to
>> get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
>> will align with the edges array that is also returned. Then you can
>> use linear interpolation to do lookups. If you use the edges array as
>> "X" and the empirical CDF as "Y", then this is a CDF. If you use the
>> empircal CDF array as "X" and the edges as "Y", then this is a PPF.
>>
>>
>> In [12]: x = np.random.uniform(0, 10, size=1000)
>>
>> In [13]: hist, edges = np.histogram(x)
>>
>> In [15]: hist
>> Out[15]: array([100, 101, 104, 108,  80, 111,  96,  88, 108, 104])
>>
>> In [16]: edges
>> Out[16]:
>> array([  3.53879571e-04,   1.00026423e+00,   2.00017458e+00,
>>         3.00008493e+00,   3.99999528e+00,   4.99990563e+00,
>>         5.99981598e+00,   6.99972633e+00,   7.99963668e+00,
>>         8.99954704e+00,   9.99945739e+00])
>>
>> In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
>>
>> In [19]: ecdf
>> Out[19]:
>> array([ 0.   ,  0.1  ,  0.201,  0.305,  0.413,  0.493,  0.604,  0.7  ,
>>        0.788,  0.896,  1.   ])
>>
>> In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
>> Out[20]:
>> array([ 0.        ,  0.0203746 ,  0.04078459,  0.06119459,  0.08160458,
>>        0.10203472,  0.12264881,  0.14326291,  0.163877  ,  0.18449109,
>>        0.20522712,  0.22645351,  0.24767991,  0.2689063 ,  0.29013269,
>>        0.31160366,  0.33364646,  0.35568925,  0.37773204,  0.39977483,
>>        0.41953158,  0.43585957,  0.45218756,  0.46851556,  0.48484355,
>>        0.50433802,  0.52699311,  0.54964821,  0.5723033 ,  0.59495839,
>>        0.61577382,  0.63536742,  0.65496101,  0.6745546 ,  0.6941482 ,
>>        0.71259664,  0.73055743,  0.74851823,  0.76647902,  0.78443982,
>>        0.80567348,  0.82771627,  0.84975906,  0.87180185,  0.89384465,
>>        0.91515087,  0.93637726,  0.95760365,  0.97883004,  1.        ])
>>
>> In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
>> Out[21]:
>> array([  3.53879571e-04,   2.04417216e-01,   4.08480553e-01,
>>         6.12543890e-01,   8.16607227e-01,   1.02046852e+00,
>>         1.22251143e+00,   1.42455434e+00,   1.62659724e+00,
>>         1.82864015e+00,   2.02980301e+00,   2.22601775e+00,
>>         2.42223250e+00,   2.61844725e+00,   2.81466200e+00,
>>         3.01047705e+00,   3.19942458e+00,   3.38837211e+00,
>>         3.57731965e+00,   3.76626718e+00,   3.95521472e+00,
>>         4.19462069e+00,   4.44969986e+00,   4.70477903e+00,
>>         4.95985820e+00,   5.15488346e+00,   5.33872431e+00,
>>         5.52256515e+00,   5.70640600e+00,   5.89024684e+00,
>>         6.08569264e+00,   6.29825861e+00,   6.51082459e+00,
>>         6.72339057e+00,   6.93595654e+00,   7.16204944e+00,
>>         7.39393960e+00,   7.62582975e+00,   7.85771991e+00,
>>         8.07294833e+00,   8.26189586e+00,   8.45084340e+00,
>>         8.63979093e+00,   8.82873846e+00,   9.01838365e+00,
>>         9.21459840e+00,   9.41081315e+00,   9.60702789e+00,
>>         9.80324264e+00,   9.99945739e+00])
>>
>
>
> This, together with the answer last week to a piecewise constant pdf,
> would make most of the elements for a distribution class, similar to
> the generic discrete distribution only for the continuous case.
>
> It could take as constructor either a data array (and use histogram as
> in this example, with theoretically optimal bin size) or a set of
> points and pdf values as in last weeks example.
>
> Does quad work correctly for integration of a piecewise function?
> If yes, then we could just use the generic methods by subclassing rv_continuous.

I suspect that it would be worthwhile to implement at least the CDF.

> Is it worth collecting these results?

Sure!

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco