[Numpy-discussion] how to fit a given pdf

Wed Aug 11 20:00:54 EDT 2010

2010/8/12 Renato Fabbri <renato.fabbri at gmail.com>:
> Dear All,
>
> help appreciated, thanks in advance.
>
> how do you fit a pdf you have with a given pdf (say gamma).
>
> with the file attached, you can go like:
>
> a=open("AC-010_ED-1m37F100P0.txt","rb")
> aa=a.read()
> aaa=aa[1:-1].split(",")
> data=[int(i) for i in aaa]
>
> if you do pylab.plot(data); pylab.show()
>
> The data is something like:
> ___|\___
>
> It is my pdf (probability density function).
>
> how can i find the right parameters to make that fit with a gamma?
>
> if i was looking for a normal pdf, for example, i would just find mean
> and std and ask for the pdf.
>
> i've been playing with scipy.stats.distributions.gamma but i have not
> reached anything.
>
> we can extend the discussion further, but this is a good starting point.
>
> any idea?

A general point on fitting empirical probability density functions is
that it is often much easier to fit the cumulative distribution
function instead.  For one thing, this means you don't have to decide
on the intervals of the bins in the histogram.  For another, it's
actually often the cdf that is more related to the final answer
(though I don't know your application, of course).

Here's a quote.

 `So far the discussion of plots of distributions has emphasized
frequency (or probability) vs. size plots, whereas for many
applications cumulative plots are more important. Cumulative curves
are produced by plotting the percentage of particles (or weight,
volume, or surface) having particle diameters greater than (or less
than) a given particle size against the particle size. … Such curves
have the advantage over histograms for plotting data that the class
interval is eliminated, and they can be used to represent data which
are obtained in classified form having unequal class intervals'
(Cadle, R. D. 1965. Particle Size. New York: Reinhold Publishing
Corporation, pp. 38-39)

Once you've got your empirical cdf, the problem reduces to one of
nonlinear curve fitting, for whichever theoretical distribution you
like.  For a tutorial on nonlinear curve fitting, see
scipy.optimize.leastsq at
http://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html.  You
could of course use this approach for the pdf too, but I fancy the cdf
result will be more robust.

On the other hand, if you want something like your `mean and variance'
approach to fitting normal distributions, you could still compare your
mean and variance with the known values for the Gamma distribution
(available e.g. on its Wikipedia page) and back-out the two parameters
of the distribution from them.  I'm not too sure how well this will
work, but it's pretty easy.

Another idea occurs to me and is about as easy as this is to compute
the two parameters of the Gamma distribution by collocation with the
empirical cdf; i.e. pick two quantiles, e.g. 0.25 and 0.75, or
whatever, and get two equations for the two unknown parameters by
insisting on the Gamma cdf agreeing with the empirical for these
quantiles.  This might be more robust than the mean & variance
approach, but I haven't tried either.

Good luck!