[SciPy-User] Scipy and statistics: probability density function
Robert Kern
robert.kern at gmail.com
Mon Aug 3 20:56:59 EDT 2009
On Mon, Jul 27, 2009 at 15:45, <josef.pktd at gmail.com> wrote:
> On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern at gmail.com> wrote:
>> On Mon, Jul 27, 2009 at 02:33, Daniel J
>> Farrell<daniel.farrell at imperial.ac.uk> wrote:
>>> Dear list,
>>>
>>> I am looking for some of the functionality provided by the GNU
>>> Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
>>> ).
>>>
>>> In particular, a need to be able to create a probability density
>>> function from my histogram of data. This will allow inverse look-ups
>>> to be performed, i.e. for a random number (0-->1) find the associated
>>> probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-distribution-struct.html
>>> ). This allows the distribution and for samples to be returned
>>> weighted by the probability of the distribution -- which is a common
>>> task!
>>
>> It looks like you want a CDF (or rather it's inverse, the PPF) rather
>> than a PDF. Anyways, this is straightforward. Compute the histogram
>> using normed=False. Find the cumulative sum and divide by the sum to
>> get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
>> will align with the edges array that is also returned. Then you can
>> use linear interpolation to do lookups. If you use the edges array as
>> "X" and the empirical CDF as "Y", then this is a CDF. If you use the
>> empircal CDF array as "X" and the edges as "Y", then this is a PPF.
>>
>>
>> In [12]: x = np.random.uniform(0, 10, size=1000)
>>
>> In [13]: hist, edges = np.histogram(x)
>>
>> In [15]: hist
>> Out[15]: array([100, 101, 104, 108, 80, 111, 96, 88, 108, 104])
>>
>> In [16]: edges
>> Out[16]:
>> array([ 3.53879571e-04, 1.00026423e+00, 2.00017458e+00,
>> 3.00008493e+00, 3.99999528e+00, 4.99990563e+00,
>> 5.99981598e+00, 6.99972633e+00, 7.99963668e+00,
>> 8.99954704e+00, 9.99945739e+00])
>>
>> In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
>>
>> In [19]: ecdf
>> Out[19]:
>> array([ 0. , 0.1 , 0.201, 0.305, 0.413, 0.493, 0.604, 0.7 ,
>> 0.788, 0.896, 1. ])
>>
>> In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
>> Out[20]:
>> array([ 0. , 0.0203746 , 0.04078459, 0.06119459, 0.08160458,
>> 0.10203472, 0.12264881, 0.14326291, 0.163877 , 0.18449109,
>> 0.20522712, 0.22645351, 0.24767991, 0.2689063 , 0.29013269,
>> 0.31160366, 0.33364646, 0.35568925, 0.37773204, 0.39977483,
>> 0.41953158, 0.43585957, 0.45218756, 0.46851556, 0.48484355,
>> 0.50433802, 0.52699311, 0.54964821, 0.5723033 , 0.59495839,
>> 0.61577382, 0.63536742, 0.65496101, 0.6745546 , 0.6941482 ,
>> 0.71259664, 0.73055743, 0.74851823, 0.76647902, 0.78443982,
>> 0.80567348, 0.82771627, 0.84975906, 0.87180185, 0.89384465,
>> 0.91515087, 0.93637726, 0.95760365, 0.97883004, 1. ])
>>
>> In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
>> Out[21]:
>> array([ 3.53879571e-04, 2.04417216e-01, 4.08480553e-01,
>> 6.12543890e-01, 8.16607227e-01, 1.02046852e+00,
>> 1.22251143e+00, 1.42455434e+00, 1.62659724e+00,
>> 1.82864015e+00, 2.02980301e+00, 2.22601775e+00,
>> 2.42223250e+00, 2.61844725e+00, 2.81466200e+00,
>> 3.01047705e+00, 3.19942458e+00, 3.38837211e+00,
>> 3.57731965e+00, 3.76626718e+00, 3.95521472e+00,
>> 4.19462069e+00, 4.44969986e+00, 4.70477903e+00,
>> 4.95985820e+00, 5.15488346e+00, 5.33872431e+00,
>> 5.52256515e+00, 5.70640600e+00, 5.89024684e+00,
>> 6.08569264e+00, 6.29825861e+00, 6.51082459e+00,
>> 6.72339057e+00, 6.93595654e+00, 7.16204944e+00,
>> 7.39393960e+00, 7.62582975e+00, 7.85771991e+00,
>> 8.07294833e+00, 8.26189586e+00, 8.45084340e+00,
>> 8.63979093e+00, 8.82873846e+00, 9.01838365e+00,
>> 9.21459840e+00, 9.41081315e+00, 9.60702789e+00,
>> 9.80324264e+00, 9.99945739e+00])
>>
>
>
> This, together with the answer last week to a piecewise constant pdf,
> would make most of the elements for a distribution class, similar to
> the generic discrete distribution only for the continuous case.
>
> It could take as constructor either a data array (and use histogram as
> in this example, with theoretically optimal bin size) or a set of
> points and pdf values as in last weeks example.
>
> Does quad work correctly for integration of a piecewise function?
> If yes, then we could just use the generic methods by subclassing rv_continuous.
I suspect that it would be worthwhile to implement at least the CDF.
> Is it worth collecting these results?
Sure!
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-User
mailing list