[SciPy-user] Fitting an arbitrary distribution

josef.pktd at gmail.com josef.pktd at gmail.com
Fri May 22 00:24:59 EDT 2009


On Thu, May 21, 2009 at 11:47 PM, David Baddeley
<david_baddeley at yahoo.com.au> wrote:
>
> Thanks for the prompt replies!
>
> I guess what I was meaning was that the PDF / histogram was the sum or multiple Gaussians/normal distibutions. Sorry about the ambiguity. I've had a quick look at the Em package and mixture models, and while my problem is similar they might be a little more general.
>
> I guess I should describe the problem in a bit more detail - I'm measuring the length of an objects which can be built up from multiple unit cells. The measured size distribution is thus multimodal, and I want to extract both the unit size and the fraction of objects having each number of unit cells. This makes the problem much more constrained than what is dealt with in the Em package.
>
> So far I've tried overriding rv_continuous to create a distribution which roughly matches - but haven't been able to fit this.

First please don't leave the entire digest in your reply.

just for clarification: Do all unit cells have the same size
distribution? because in that case you have a lot more structure in
your distribution than is generally assumed in mixture models. Also
the number of parameters to estimate would be much smaller. So maximum
likelihood might work relatively well, if you give it good starting
values with information you get from the histogram.

If the size distribution of a unit cell is additionally normally
distributed, then it would be possible to write the correct likelihood
function and use fit to estimate the distribution parameters.

f(x) = sum_n f(x|n,theta) p(n)

where f(x|n,theta) would be the normal distribution of n iid random
variables (the cell sizes) and theta would be the common mean and
variance of the normal distribution
p(n) could be non-parametric, a vector of p's, or parametric e.g. Poisson.

The distribution parameters would be just mean, variance and the p's

This should be doable by subclassing rv_continuous, I try to look for
a an example for subclassing. However, this still wouldn't give you
test statistics, AIC information criteria and so on.

If you don't want to impose so much structure, you might be better of
with the general packages for mixtures, e.g. David's EM, or I think
pymc should also be able to handle this, although I never tried.

Josef



More information about the SciPy-User mailing list