[SciPy-User] Scipy and statistics: probability density function
josef.pktd at gmail.com
josef.pktd at gmail.com
Mon Jul 27 16:45:37 EDT 2009
On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern at gmail.com> wrote:
> On Mon, Jul 27, 2009 at 02:33, Daniel J
> Farrell<daniel.farrell at imperial.ac.uk> wrote:
>> Dear list,
>>
>> I am looking for some of the functionality provided by the GNU
>> Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
>> ).
>>
>> In particular, a need to be able to create a probability density
>> function from my histogram of data. This will allow inverse look-ups
>> to be performed, i.e. for a random number (0-->1) find the associated
>> probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-distribution-struct.html
>> ). This allows the distribution and for samples to be returned
>> weighted by the probability of the distribution -- which is a common
>> task!
>
> It looks like you want a CDF (or rather it's inverse, the PPF) rather
> than a PDF. Anyways, this is straightforward. Compute the histogram
> using normed=False. Find the cumulative sum and divide by the sum to
> get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
> will align with the edges array that is also returned. Then you can
> use linear interpolation to do lookups. If you use the edges array as
> "X" and the empirical CDF as "Y", then this is a CDF. If you use the
> empircal CDF array as "X" and the edges as "Y", then this is a PPF.
>
>
> In [12]: x = np.random.uniform(0, 10, size=1000)
>
> In [13]: hist, edges = np.histogram(x)
>
> In [15]: hist
> Out[15]: array([100, 101, 104, 108, 80, 111, 96, 88, 108, 104])
>
> In [16]: edges
> Out[16]:
> array([ 3.53879571e-04, 1.00026423e+00, 2.00017458e+00,
> 3.00008493e+00, 3.99999528e+00, 4.99990563e+00,
> 5.99981598e+00, 6.99972633e+00, 7.99963668e+00,
> 8.99954704e+00, 9.99945739e+00])
>
> In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
>
> In [19]: ecdf
> Out[19]:
> array([ 0. , 0.1 , 0.201, 0.305, 0.413, 0.493, 0.604, 0.7 ,
> 0.788, 0.896, 1. ])
>
> In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
> Out[20]:
> array([ 0. , 0.0203746 , 0.04078459, 0.06119459, 0.08160458,
> 0.10203472, 0.12264881, 0.14326291, 0.163877 , 0.18449109,
> 0.20522712, 0.22645351, 0.24767991, 0.2689063 , 0.29013269,
> 0.31160366, 0.33364646, 0.35568925, 0.37773204, 0.39977483,
> 0.41953158, 0.43585957, 0.45218756, 0.46851556, 0.48484355,
> 0.50433802, 0.52699311, 0.54964821, 0.5723033 , 0.59495839,
> 0.61577382, 0.63536742, 0.65496101, 0.6745546 , 0.6941482 ,
> 0.71259664, 0.73055743, 0.74851823, 0.76647902, 0.78443982,
> 0.80567348, 0.82771627, 0.84975906, 0.87180185, 0.89384465,
> 0.91515087, 0.93637726, 0.95760365, 0.97883004, 1. ])
>
> In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
> Out[21]:
> array([ 3.53879571e-04, 2.04417216e-01, 4.08480553e-01,
> 6.12543890e-01, 8.16607227e-01, 1.02046852e+00,
> 1.22251143e+00, 1.42455434e+00, 1.62659724e+00,
> 1.82864015e+00, 2.02980301e+00, 2.22601775e+00,
> 2.42223250e+00, 2.61844725e+00, 2.81466200e+00,
> 3.01047705e+00, 3.19942458e+00, 3.38837211e+00,
> 3.57731965e+00, 3.76626718e+00, 3.95521472e+00,
> 4.19462069e+00, 4.44969986e+00, 4.70477903e+00,
> 4.95985820e+00, 5.15488346e+00, 5.33872431e+00,
> 5.52256515e+00, 5.70640600e+00, 5.89024684e+00,
> 6.08569264e+00, 6.29825861e+00, 6.51082459e+00,
> 6.72339057e+00, 6.93595654e+00, 7.16204944e+00,
> 7.39393960e+00, 7.62582975e+00, 7.85771991e+00,
> 8.07294833e+00, 8.26189586e+00, 8.45084340e+00,
> 8.63979093e+00, 8.82873846e+00, 9.01838365e+00,
> 9.21459840e+00, 9.41081315e+00, 9.60702789e+00,
> 9.80324264e+00, 9.99945739e+00])
>
This, together with the answer last week to a piecewise constant pdf,
would make most of the elements for a distribution class, similar to
the generic discrete distribution only for the continuous case.
It could take as constructor either a data array (and use histogram as
in this example, with theoretically optimal bin size) or a set of
points and pdf values as in last weeks example.
Does quad work correctly for integration of a piecewise function?
If yes, then we could just use the generic methods by subclassing rv_continuous.
Is it worth collecting these results?
Josef
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
> -- Umberto Eco
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list