Re: [SciPy-User] Scipy and statistics: probability density function

27 Jul 2009

      On Mon, Jul 27, 2009 at 11:59 AM, Robert Kern<robert.kern@gmail.com> wrote:
...
On Mon, Jul 27, 2009 at 02:33, Daniel J
Farrell<daniel.farrell@imperial.ac.uk> wrote:
...
Dear list,
I am looking for some of the functionality provided by the GNU
Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
).
In particular, a need to be able to create a probability density
function from my histogram of data. This will allow inverse look-ups
to be performed, i.e. for a random number (0-->1) find the associated
probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-d...
). This allows the distribution and for samples to be returned
weighted by the probability of the distribution -- which is a common
task!
It looks like you want a CDF (or rather it's inverse, the PPF) rather
than a PDF. Anyways, this is straightforward. Compute the histogram
using normed=False. Find the cumulative sum and divide by the sum to
get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
will align with the edges array that is also returned. Then you can
use linear interpolation to do lookups. If you use the edges array as
"X" and the empirical CDF as "Y", then this is a CDF. If you use the
empircal CDF array as "X" and the edges as "Y", then this is a PPF.
In [12]: x = np.random.uniform(0, 10, size=1000)
In [13]: hist, edges = np.histogram(x)
In [15]: hist
Out[15]: array([100, 101, 104, 108,  80, 111,  96,  88, 108, 104])
In [16]: edges
Out[16]:
array([  3.53879571e-04,   1.00026423e+00,   2.00017458e+00,
        3.00008493e+00,   3.99999528e+00,   4.99990563e+00,
        5.99981598e+00,   6.99972633e+00,   7.99963668e+00,
        8.99954704e+00,   9.99945739e+00])
In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
In [19]: ecdf
Out[19]:
array([ 0.   ,  0.1  ,  0.201,  0.305,  0.413,  0.493,  0.604,  0.7  ,
       0.788,  0.896,  1.   ])
In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
Out[20]:
array([ 0.        ,  0.0203746 ,  0.04078459,  0.06119459,  0.08160458,
       0.10203472,  0.12264881,  0.14326291,  0.163877  ,  0.18449109,
       0.20522712,  0.22645351,  0.24767991,  0.2689063 ,  0.29013269,
       0.31160366,  0.33364646,  0.35568925,  0.37773204,  0.39977483,
       0.41953158,  0.43585957,  0.45218756,  0.46851556,  0.48484355,
       0.50433802,  0.52699311,  0.54964821,  0.5723033 ,  0.59495839,
       0.61577382,  0.63536742,  0.65496101,  0.6745546 ,  0.6941482 ,
       0.71259664,  0.73055743,  0.74851823,  0.76647902,  0.78443982,
       0.80567348,  0.82771627,  0.84975906,  0.87180185,  0.89384465,
       0.91515087,  0.93637726,  0.95760365,  0.97883004,  1.        ])
In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
Out[21]:
array([  3.53879571e-04,   2.04417216e-01,   4.08480553e-01,
        6.12543890e-01,   8.16607227e-01,   1.02046852e+00,
        1.22251143e+00,   1.42455434e+00,   1.62659724e+00,
        1.82864015e+00,   2.02980301e+00,   2.22601775e+00,
        2.42223250e+00,   2.61844725e+00,   2.81466200e+00,
        3.01047705e+00,   3.19942458e+00,   3.38837211e+00,
        3.57731965e+00,   3.76626718e+00,   3.95521472e+00,
        4.19462069e+00,   4.44969986e+00,   4.70477903e+00,
        4.95985820e+00,   5.15488346e+00,   5.33872431e+00,
        5.52256515e+00,   5.70640600e+00,   5.89024684e+00,
        6.08569264e+00,   6.29825861e+00,   6.51082459e+00,
        6.72339057e+00,   6.93595654e+00,   7.16204944e+00,
        7.39393960e+00,   7.62582975e+00,   7.85771991e+00,
        8.07294833e+00,   8.26189586e+00,   8.45084340e+00,
        8.63979093e+00,   8.82873846e+00,   9.01838365e+00,
        9.21459840e+00,   9.41081315e+00,   9.60702789e+00,
        9.80324264e+00,   9.99945739e+00])
This, together with the answer last week to a piecewise constant pdf,
would make most of the elements for a distribution class, similar to
the generic discrete distribution only for the continuous case.

It could take as constructor either a data array (and use histogram as
in this example, with theoretically optimal bin size) or a set of
points and pdf values as in last weeks example.

Does quad work correctly for integration of a piecewise function?
If yes, then we could just use the generic methods by subclassing rv_continuous.

Is it worth collecting these results?

Josef
...
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy-User@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Re: [SciPy-User] Scipy and statistics: probability density function

josef.pktd＠gmail.com