[SciPy-user] Fitting an arbitrary distribution

Thu May 21 23:47:21 EDT 2009

Thanks for the prompt replies!

I guess what I was meaning was that the PDF / histogram was the sum or multiple Gaussians/normal distibutions. Sorry about the ambiguity. I've had a quick look at the Em package and mixture models, and while my problem is similar they might be a little more general.

I guess I should describe the problem in a bit more detail - I'm measuring the length of an objects which can be built up from multiple unit cells. The measured size distribution is thus multimodal, and I want to extract both the unit size and the fraction of objects having each number of unit cells. This makes the problem much more constrained than what is dealt with in the Em package.

So far I've tried overriding rv_continuous to create a distribution which roughly matches - but haven't been able to fit this.

cheers,
David

----- Original Message ----
From: "scipy-user-request at scipy.org" <scipy-user-request at scipy.org>
To: scipy-user at scipy.org
Sent: Friday, 22 May, 2009 2:44:37 PM
Subject: SciPy-user Digest, Vol 69, Issue 43

Send SciPy-user mailing list submissions to
    scipy-user at scipy.org

To subscribe or unsubscribe via the World Wide Web, visit
    http://mail.scipy.org/mailman/listinfo/scipy-user
or, via email, send a message with subject or body 'help' to
    scipy-user-request at scipy.org

You can reach the person managing the list at
    scipy-user-owner at scipy.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of SciPy-user digest..."

Today's Topics:

   1. Fitting an arbitrary distribution (David Baddeley)
   2. Re: Fitting an arbitrary distribution (David Cournapeau)
   3. Re: Inconsistent function calls? (Ivo Maljevic)
   4. Re: Fitting an arbitrary distribution (josef.pktd at gmail.com)
   5. Re: Fitting an arbitrary distribution (josef.pktd at gmail.com)
   6. Re: Fitting an arbitrary distribution (David Cournapeau)
   7. Re: Fitting an arbitrary distribution (josef.pktd at gmail.com)
   8. Re: Fitting an arbitrary distribution (josef.pktd at gmail.com)

----------------------------------------------------------------------

Message: 1
Date: Thu, 21 May 2009 18:47:00 -0700 (PDT)
From: David Baddeley <david_baddeley at yahoo.com.au>
Subject: [SciPy-user] Fitting an arbitrary distribution
To: scipy-user at scipy.org
Message-ID: <36002.66689.qm at web33005.mail.mud.yahoo.com>
Content-Type: text/plain; charset=utf-8

Hi all,

I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?

thanks in advance,
David

------------------------------

Message: 2
Date: Fri, 22 May 2009 10:58:06 +0900
From: David Cournapeau <david at ar.media.kyoto-u.ac.jp>
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: David Baddeley <david_baddeley at yahoo.com.au>,    SciPy Users List
    <scipy-user at scipy.org>
Message-ID: <4A1606AE.1030008 at ar.media.kyoto-u.ac.jp>
Content-Type: text/plain; charset=ISO-8859-1

David Baddeley wrote:
> Hi all,
>
> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?

That's a complex topic in general, there is no best answer, it depends
on your case, and what you intend to do with the estimated distribution.

In the case of a sum of mutiple Gaussians, the more commonly used name
for this model is mixture models, and there is a vast range of possible
techniques for fitting a dataset to this model. There is a package in
scikits.learn to use the so-called Expectation Maximization algorithm to
estimate the maximum likelihood of such models

http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/

You can have an overview on the wiki page:

http://en.wikipedia.org/wiki/Mixture_model

cheers,

David

------------------------------

Message: 3
Date: Thu, 21 May 2009 22:17:42 -0400
From: Ivo Maljevic <ivo.maljevic at gmail.com>
Subject: Re: [SciPy-user] Inconsistent function calls?
To: SciPy Users List <scipy-user at scipy.org>
Message-ID:
    <826c64da0905211917u15ec1567g72547e6cff117535 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Sorry Christopher, I thought since they are used for the same purpose, and
have similar syntax (http://www.scipy.org/NumPy_for_Matlab_Users says
``MATLAB? and NumPy/SciPy have a lot in common``), that SciPy looks more
like Matlab than any other programing language (excluding Octave and other
Matlab clones).

As for everything else you wrote, I already said that I don`t have any
problem with using SciPy the way it is.

Ivo

2009/5/21 Christopher Barker <Chris.Barker at noaa.gov>

> Ivo Maljevic wrote:
> > why bother to make something that looks like matlab,
>
> who ever said numpy "looks like matlab", any more than it look s like
> any number of other programming environments...
>
> > Matplotlib does a pretty good job at  replicating
> > matlab plot functions, at least at the level I need it to.
>
> Because is was designed exactly to do that -- but I think MPL's Matlab
> replicating has been a hindrance, rather than a help, to a good API.
> However, is has been a help to its adoption.
>
> You may have noticed that over the years MPL is moving away from matlab,
> toward a more pythonic API.
>
> Personally, I like python so much more than Matlab exactly for these
> differences (and so many more). I suppose it's tough if you switch back
> and forth, but I haven't touched Matlab in years.
>
> It is rand() that is inconsistent, and that is an accident of history.
>
> > what ones([3,3]) does, the same way random.rand(3,3) does,
>
> well, rand() is a convenience function, and doesn't take a bunch of
> other parameters.  In fact, it's listed under "Compatibility functions",
> and is really a wrapper for:
>
> numpy.random.uniform, which takes a shape argument.
>
> > the reason why I included that error message in my previous message
> > is because I think it is completely non-helpful.
>
> That's another issue -- non-helpful error messages do show up a lot --
> in that case, if the user had typed:
>
> np.zeros(3, dtype=3)
>
> the error message would make sense. If you can suggest a better message,
> patches are always welcome.
>
> -Chris
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20090521/9b06a80b/attachment-0001.html 

------------------------------

Message: 4
Date: Thu, 21 May 2009 22:27:20 -0400
From: josef.pktd at gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: David Baddeley <david_baddeley at yahoo.com.au>,    SciPy Users List
    <scipy-user at scipy.org>
Message-ID:
    <1cd32cbb0905211927l2ec6e3fbs1a5922b21bc966bd at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 9:47 PM, David Baddeley
<david_baddeley at yahoo.com.au> wrote:
>
> Hi all,
>
> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>

I have an example script that tries to fit a dataset to all
distributions in scipy.stats

http://code.google.com/p/joepython/source/browse/trunk/joepython/scipystats/enhance/try_VaR.py

I use ksstat as distance metric.

If you have data with full support on the real line and look only at
those distributions, then the current fit method works pretty well.
Problems exist for distribution with a finite support boundary point.
And stats.distributions only has univariate distributions, there is no
support for multivariate distributions.
I have also written several extension distributions (also univariate
only), that are however not yet in scipy.

What exactly do you mean with "sum of multiple Gaussians"? If i take
it literally as sum of several normal distributed random variables,
then the distribution would be just normal again.

If you provide some more information on the structure of your data, I
would be better able to see if scipy.stats can handle them.

Josef

------------------------------

Message: 5
Date: Thu, 21 May 2009 22:33:12 -0400
From: josef.pktd at gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user at scipy.org>
Message-ID:
    <1cd32cbb0905211933x53ea5b88na2f64934c5f121c at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
<david at ar.media.kyoto-u.ac.jp> wrote:
> David Baddeley wrote:
>> Hi all,
>>
>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>
> That's a complex topic in general, there is no best answer, it depends
> on your case, and what you intend to do with the estimated distribution.
>
> In the case of a sum of mutiple Gaussians, the more commonly used name
> for this model is mixture models, and there is a vast range of possible
> techniques for fitting a dataset to this model. There is a package in
> scikits.learn to use the so-called Expectation Maximization algorithm to
> estimate the maximum likelihood of such models
>
> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>
> You can have an overview on the wiki page:
>
> http://en.wikipedia.org/wiki/Mixture_model
>

Sum of random variables are convolutions, and are very different from
mixtures of distributions. I just got confused in a discussion today
when the other person talked about convolutions and I thought about
mixtures and it didn't make a lot of sense.

so, which is it?

Josef

------------------------------

Message: 6
Date: Fri, 22 May 2009 11:23:00 +0900
From: David Cournapeau <david at ar.media.kyoto-u.ac.jp>
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user at scipy.org>
Message-ID: <4A160C84.9050500 at ar.media.kyoto-u.ac.jp>
Content-Type: text/plain; charset=ISO-8859-1

josef.pktd at gmail.com wrote:
> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
> <david at ar.media.kyoto-u.ac.jp> wrote:
>  
>> David Baddeley wrote:
>>    
>>> Hi all,
>>>
>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>>      
>> That's a complex topic in general, there is no best answer, it depends
>> on your case, and what you intend to do with the estimated distribution.
>>
>> In the case of a sum of mutiple Gaussians, the more commonly used name
>> for this model is mixture models, and there is a vast range of possible
>> techniques for fitting a dataset to this model. There is a package in
>> scikits.learn to use the so-called Expectation Maximization algorithm to
>> estimate the maximum likelihood of such models
>>
>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>
>> You can have an overview on the wiki page:
>>
>> http://en.wikipedia.org/wiki/Mixture_model
>>
>>    
>
> Sum of random variables are convolutions, and are very different from
> mixtures of distributions. I just got confused in a discussion today
> when the other person talked about convolutions and I thought about
> mixtures and it didn't make a lot of sense.
>  

It depends on what is meant by sum of Gaussians: sum of the random
variables or sum of the distribution. In the case of the sum of random
variables, then it is a convolution as you mentioned (assuming
independence of the random variables). But I think some people think
mostly in terms of histogram/distributions, specially if they are not
statisticians. I don't understand the term "sum of gaussians" as a
technical term.

David

------------------------------

Message: 7
Date: Thu, 21 May 2009 22:41:37 -0400
From: josef.pktd at gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user at scipy.org>
Message-ID:
    <1cd32cbb0905211941l5b7f6611g84aaedcd57150b9e at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 10:33 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
> <david at ar.media.kyoto-u.ac.jp> wrote:
>> David Baddeley wrote:
>>> Hi all,
>>>
>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>
>> That's a complex topic in general, there is no best answer, it depends
>> on your case, and what you intend to do with the estimated distribution.
>>
>> In the case of a sum of mutiple Gaussians, the more commonly used name
>> for this model is mixture models, and there is a vast range of possible
>> techniques for fitting a dataset to this model. There is a package in
>> scikits.learn to use the so-called Expectation Maximization algorithm to
>> estimate the maximum likelihood of such models
>>
>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>
>> You can have an overview on the wiki page:
>>
>> http://en.wikipedia.org/wiki/Mixture_model
>>
>
> Sum of random variables are convolutions, and are very different from
> mixtures of distributions. I just got confused in a discussion today
> when the other person talked about convolutions and I thought about
> mixtures and it didn't make a lot of sense.
>
> so, which is it?
>

Actually, Gaussians is in this context ambiguous, does it mean a
random variable or refer to the density/distribution function.
Sum of random variable is very different from a (weighted) sum of
distribution functions, which both are possible interpretation of "sum
of Gaussians"

Josef

------------------------------

Message: 8
Date: Thu, 21 May 2009 22:44:30 -0400
From: josef.pktd at gmail.com
Subject: Re: [SciPy-user] Fitting an arbitrary distribution
To: SciPy Users List <scipy-user at scipy.org>
Message-ID:
    <1cd32cbb0905211944x5e65ad10r75a5b43c676a290b at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, May 21, 2009 at 10:23 PM, David Cournapeau
<david at ar.media.kyoto-u.ac.jp> wrote:
> josef.pktd at gmail.com wrote:
>> On Thu, May 21, 2009 at 9:58 PM, David Cournapeau
>> <david at ar.media.kyoto-u.ac.jp> wrote:
>>
>>> David Baddeley wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to fit an arbitrary distribution (in this case the sum of multiple Gaussians) to some measured data and was wondering if anyone could give me any pointers as to the best way of doing this. I'd like to avoid fitting to a histogram if possible. How do the .fit() methods of the various distributions under scipy.stats do it? My first thought would be to compare the cumulative distribution of my data with that of the model distibution using something like the kolmogorov-smirnov metric (maximum absolute distance between the curves) and to minimize this using optimize.fmin. Is this the right way to do it? Or is there an easier way?
>>>>
>>> That's a complex topic in general, there is no best answer, it depends
>>> on your case, and what you intend to do with the estimated distribution.
>>>
>>> In the case of a sum of mutiple Gaussians, the more commonly used name
>>> for this model is mixture models, and there is a vast range of possible
>>> techniques for fitting a dataset to this model. There is a package in
>>> scikits.learn to use the so-called Expectation Maximization algorithm to
>>> estimate the maximum likelihood of such models
>>>
>>> http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
>>>
>>> You can have an overview on the wiki page:
>>>
>>> http://en.wikipedia.org/wiki/Mixture_model
>>>
>>>
>>
>> Sum of random variables are convolutions, and are very different from
>> mixtures of distributions. I just got confused in a discussion today
>> when the other person talked about convolutions and I thought about
>> mixtures and it didn't make a lot of sense.
>>
>
> It depends on what is meant by sum of Gaussians: sum of the random
> variables or sum of the distribution. In the case of the sum of random
> variables, then it is a convolution as you mentioned (assuming
> independence of the random variables). But I think some people think
> mostly in terms of histogram/distributions, specially if they are not
> statisticians. I don't understand the term "sum of gaussians" as a
> technical term.
>

Yes, I agree, you were ahead of me on realizing this.

Josef

------------------------------

_______________________________________________
SciPy-user mailing list
SciPy-user at scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

End of SciPy-user Digest, Vol 69, Issue 43
******************************************