How to fit parameters of beta distribution?

Hi, I can see a instancemethod scipy.stats.beta.fit. I can't work out from the docs how to use it. From trial & error I got the following: In [12]: scipy.stats.beta.fit([.5]) Out[12]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01]) What are the 4 values output by the method? Thanks, John.

On Jun 24, 2011, at 11:26 AM, John Reid wrote:
Hi,
I can see a instancemethod scipy.stats.beta.fit. I can't work out from the docs how to use it. From trial & error I got the following:
In [12]: scipy.stats.beta.fit([.5]) Out[12]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
What are the 4 values output by the method?
Thanks, John.
Hi John, the short answer is (a, b, loc, scale), but you probably want to fix loc=0 and scale=1 to get meaningful a, b estimates. It takes some time to learn how scipy.stats.rv_continuous works, but this is a good starting point: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#distributions There you'll see that every rv_continuous distribution (e.g. norm, chi2, beta) has two parameters loc and scale, which shift and stretch the distribution like this: (x - loc) / scale E.g. from the docstring of scipy.stats.norm, you can see that norm uses these two parameters and has no extra "shape parameters": Normal distribution The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation. normal.pdf(x) = exp(-x**2/2)/sqrt(2*pi) You can draw a random data sample and fit it like this: data = scipy.stats.norm.rvs(loc=10, scale=2, size=100) scipy.stats.norm.fit(data) # returns loc, scale # (9.9734277669649689, 2.2125503785545551) The beta distribution you are interested in has two shape parameters a and b, plus in addition the loc and scale parameters every rv_continuous has: Beta distribution beta.pdf(x, a, b) = gamma(a+b)/(gamma(a)*gamma(b)) * x**(a-1) * (1-x)**(b-1) for 0 < x < 1, a, b > 0. In your case you probably want to fix loc=0 and scale=1 and only fit the a and b parameter, which you can do like this: data = scipy.stats.beta.rvs(2, 5, size=100) # a = 2, b = 5 (can't use keyword arguments) scipy.stats.beta.fit(data, floc=0, fscale=1) # returns a, b, loc, scale # (2.6928363303187393, 5.9855671734557454, 0, 1) I find that the splitting of parameters into "location and scale" and "shape" makes rv_continuous usage complicated: - it is uncommon that the beta or chi2 or many other distributions have a loc and scale parameter - the auto-generated docstrings are confusing at first But if you look at the implementation it does avoid some repetitive code for the developers. Btw., I don't know how you can fit multiple parameters to only one measurement [.5] in your example. You must have executed some code before that line, otherwise you'll get a bunch of RuntimeWarnings and a different return value from the one you give (I use on scipy 0.9) In [1]: import scipy.stats In [2]: scipy.stats.beta.fit([.5]) Out[2]: (1.0, 1.0, 0.5, 0.0) Christoph

Thanks for the information. Just out of interest, this is what I get on scipy 0.7 (no warnings) In [1]: import scipy.stats In [2]: scipy.stats.beta.fit([.5]) Out[2]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01]) In [3]: scipy.__version__ Out[3]: '0.7.0' Also I have (following your advice): In [7]: scipy.stats.beta.fit([.5], floc=0., fscale=1.) Out[7]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01]) which just seems wrong, surely the loc and scale in the output should be what I specified in the arguments? In any case from your example, it seems like it is fixed in 0.9 I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point. Thanks, John. On 24/06/11 12:20, Christoph Deil wrote:
On Jun 24, 2011, at 11:26 AM, John Reid wrote:
Hi,
I can see a instancemethod scipy.stats.beta.fit. I can't work out from the docs how to use it. From trial& error I got the following:
In [12]: scipy.stats.beta.fit([.5]) Out[12]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
What are the 4 values output by the method?
Thanks, John.
Hi John,
the short answer is (a, b, loc, scale), but you probably want to fix loc=0 and scale=1 to get meaningful a, b estimates.
It takes some time to learn how scipy.stats.rv_continuous works, but this is a good starting point: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#distributions
There you'll see that every rv_continuous distribution (e.g. norm, chi2, beta) has two parameters loc and scale, which shift and stretch the distribution like this: (x - loc) / scale
E.g. from the docstring of scipy.stats.norm, you can see that norm uses these two parameters and has no extra "shape parameters": Normal distribution The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation. normal.pdf(x) = exp(-x**2/2)/sqrt(2*pi)
You can draw a random data sample and fit it like this: data = scipy.stats.norm.rvs(loc=10, scale=2, size=100) scipy.stats.norm.fit(data) # returns loc, scale # (9.9734277669649689, 2.2125503785545551)
The beta distribution you are interested in has two shape parameters a and b, plus in addition the loc and scale parameters every rv_continuous has: Beta distribution beta.pdf(x, a, b) = gamma(a+b)/(gamma(a)*gamma(b)) * x**(a-1) * (1-x)**(b-1) for 0< x< 1, a, b> 0.
In your case you probably want to fix loc=0 and scale=1 and only fit the a and b parameter, which you can do like this: data = scipy.stats.beta.rvs(2, 5, size=100) # a = 2, b = 5 (can't use keyword arguments) scipy.stats.beta.fit(data, floc=0, fscale=1) # returns a, b, loc, scale # (2.6928363303187393, 5.9855671734557454, 0, 1)
I find that the splitting of parameters into "location and scale" and "shape" makes rv_continuous usage complicated: - it is uncommon that the beta or chi2 or many other distributions have a loc and scale parameter - the auto-generated docstrings are confusing at first But if you look at the implementation it does avoid some repetitive code for the developers.
Btw., I don't know how you can fit multiple parameters to only one measurement [.5] in your example. You must have executed some code before that line, otherwise you'll get a bunch of RuntimeWarnings and a different return value from the one you give (I use on scipy 0.9) In [1]: import scipy.stats In [2]: scipy.stats.beta.fit([.5]) Out[2]: (1.0, 1.0, 0.5, 0.0)
Christoph

On Fri, Jun 24, 2011 at 8:37 AM, John Reid <j.reid@mail.cryst.bbk.ac.uk> wrote:
Thanks for the information. Just out of interest, this is what I get on scipy 0.7 (no warnings)
In [1]: import scipy.stats
In [2]: scipy.stats.beta.fit([.5]) Out[2]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
In [3]: scipy.__version__ Out[3]: '0.7.0'
Also I have (following your advice):
In [7]: scipy.stats.beta.fit([.5], floc=0., fscale=1.) Out[7]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
which just seems wrong, surely the loc and scale in the output should be what I specified in the arguments? In any case from your example, it seems like it is fixed in 0.9
floc an fscale where added in scipy 0.9, extra keywords on 0.7 were just ignored
I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point.
You need at least as many observations as parameters, and without enough observations the estimate will be very noisy. With fewer observations than parameters, you cannot identify the parameters. Josef
Thanks, John.
On 24/06/11 12:20, Christoph Deil wrote:
On Jun 24, 2011, at 11:26 AM, John Reid wrote:
Hi,
I can see a instancemethod scipy.stats.beta.fit. I can't work out from the docs how to use it. From trial& error I got the following:
In [12]: scipy.stats.beta.fit([.5]) Out[12]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
What are the 4 values output by the method?
Thanks, John.
Hi John,
the short answer is (a, b, loc, scale), but you probably want to fix loc=0 and scale=1 to get meaningful a, b estimates.
It takes some time to learn how scipy.stats.rv_continuous works, but this is a good starting point: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#distributions
There you'll see that every rv_continuous distribution (e.g. norm, chi2, beta) has two parameters loc and scale, which shift and stretch the distribution like this: (x - loc) / scale
E.g. from the docstring of scipy.stats.norm, you can see that norm uses these two parameters and has no extra "shape parameters": Normal distribution The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation. normal.pdf(x) = exp(-x**2/2)/sqrt(2*pi)
You can draw a random data sample and fit it like this: data = scipy.stats.norm.rvs(loc=10, scale=2, size=100) scipy.stats.norm.fit(data) # returns loc, scale # (9.9734277669649689, 2.2125503785545551)
The beta distribution you are interested in has two shape parameters a and b, plus in addition the loc and scale parameters every rv_continuous has: Beta distribution beta.pdf(x, a, b) = gamma(a+b)/(gamma(a)*gamma(b)) * x**(a-1) * (1-x)**(b-1) for 0< x< 1, a, b> 0.
In your case you probably want to fix loc=0 and scale=1 and only fit the a and b parameter, which you can do like this: data = scipy.stats.beta.rvs(2, 5, size=100) # a = 2, b = 5 (can't use keyword arguments) scipy.stats.beta.fit(data, floc=0, fscale=1) # returns a, b, loc, scale # (2.6928363303187393, 5.9855671734557454, 0, 1)
I find that the splitting of parameters into "location and scale" and "shape" makes rv_continuous usage complicated: - it is uncommon that the beta or chi2 or many other distributions have a loc and scale parameter - the auto-generated docstrings are confusing at first But if you look at the implementation it does avoid some repetitive code for the developers.
Btw., I don't know how you can fit multiple parameters to only one measurement [.5] in your example. You must have executed some code before that line, otherwise you'll get a bunch of RuntimeWarnings and a different return value from the one you give (I use on scipy 0.9) In [1]: import scipy.stats In [2]: scipy.stats.beta.fit([.5]) Out[2]: (1.0, 1.0, 0.5, 0.0)
Christoph
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

On 24/06/11 13:58, josef.pktd@gmail.com wrote:
On Fri, Jun 24, 2011 at 8:37 AM, John Reid<j.reid@mail.cryst.bbk.ac.uk> wrote:
Thanks for the information. Just out of interest, this is what I get on scipy 0.7 (no warnings)
In [1]: import scipy.stats
In [2]: scipy.stats.beta.fit([.5]) Out[2]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
In [3]: scipy.__version__ Out[3]: '0.7.0'
Also I have (following your advice):
In [7]: scipy.stats.beta.fit([.5], floc=0., fscale=1.) Out[7]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
which just seems wrong, surely the loc and scale in the output should be what I specified in the arguments? In any case from your example, it seems like it is fixed in 0.9
floc an fscale where added in scipy 0.9, extra keywords on 0.7 were just ignored
OK
I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point.
You need at least as many observations as parameters, and without enough observations the estimate will be very noisy. With fewer observations than parameters, you cannot identify the parameters.
I'm not quite sure what you mean by "identify". It is a ML estimate isn't it? That seems legitimate here but it wasn't really my original question. I was just using [.5] as an example. Thanks, John.

On Jun 24, 2011, at 3:09 PM, John Reid wrote:
I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point.
You need at least as many observations as parameters, and without enough observations the estimate will be very noisy. With fewer observations than parameters, you cannot identify the parameters.
I'm not quite sure what you mean by "identify". It is a ML estimate isn't it? That seems legitimate here but it wasn't really my original question. I was just using [.5] as an example.
Thanks, John.
Technically you can compute the ML estimate of both parameters of a two-parameter distribution from one datapoint: In [2]: scipy.stats.norm.fit([0]) Out[2]: (4.2006250261886009e-22, 2.0669568930051829e-21) In [7]: scipy.stats.norm.fit([1]) Out[7]: (1.0, 5.4210108624275222e-20) But in this case the width estimate of 0 is not meaningful, as you will get ML estimated width 0 for any true width because you don't have enough data to estimate the width. You need at least two data points to get real estimates for two parameters: In [6]: scipy.stats.norm.fit([0,1]) Out[6]: (0.5, 0.5)

On Fri, Jun 24, 2011 at 9:09 AM, John Reid <j.reid@mail.cryst.bbk.ac.uk> wrote:
On 24/06/11 13:58, josef.pktd@gmail.com wrote:
On Fri, Jun 24, 2011 at 8:37 AM, John Reid<j.reid@mail.cryst.bbk.ac.uk> wrote:
Thanks for the information. Just out of interest, this is what I get on scipy 0.7 (no warnings)
In [1]: import scipy.stats
In [2]: scipy.stats.beta.fit([.5]) Out[2]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
In [3]: scipy.__version__ Out[3]: '0.7.0'
Also I have (following your advice):
In [7]: scipy.stats.beta.fit([.5], floc=0., fscale=1.) Out[7]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
which just seems wrong, surely the loc and scale in the output should be what I specified in the arguments? In any case from your example, it seems like it is fixed in 0.9
floc an fscale where added in scipy 0.9, extra keywords on 0.7 were just ignored
OK
I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point.
You need at least as many observations as parameters, and without enough observations the estimate will be very noisy. With fewer observations than parameters, you cannot identify the parameters.
I'm not quite sure what you mean by "identify". It is a ML estimate isn't it? That seems legitimate here but it wasn't really my original question. I was just using [.5] as an example.
simplest example: fit a linear regression line through one point. There are an infinite number of solutions, that all fit the point exactly. So we cannot estimate constant and slope, but if we fix one, we can estimate the other parameter. Or, in Christoph's example below you just get a mass point, degenerate solution, in other cases the Hessian will be singular. Josef
Thanks, John.
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

On 24/06/11 14:32, josef.pktd@gmail.com wrote:
On Fri, Jun 24, 2011 at 9:09 AM, John Reid<j.reid@mail.cryst.bbk.ac.uk> wrote:
On 24/06/11 13:58, josef.pktd@gmail.com wrote:
On Fri, Jun 24, 2011 at 8:37 AM, John Reid<j.reid@mail.cryst.bbk.ac.uk> wrote:
Thanks for the information. Just out of interest, this is what I get on scipy 0.7 (no warnings)
In [1]: import scipy.stats
In [2]: scipy.stats.beta.fit([.5]) Out[2]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
In [3]: scipy.__version__ Out[3]: '0.7.0'
Also I have (following your advice):
In [7]: scipy.stats.beta.fit([.5], floc=0., fscale=1.) Out[7]: array([ 1.87795851e+00, 1.81444871e-01, 2.39026963e-04, 4.99760973e-01])
which just seems wrong, surely the loc and scale in the output should be what I specified in the arguments? In any case from your example, it seems like it is fixed in 0.9
floc an fscale where added in scipy 0.9, extra keywords on 0.7 were just ignored
OK
I'm assuming fit() does a ML estimate of the parameters which I think is fine to do for a beta distribution and one data point.
You need at least as many observations as parameters, and without enough observations the estimate will be very noisy. With fewer observations than parameters, you cannot identify the parameters.
I'm not quite sure what you mean by "identify". It is a ML estimate isn't it? That seems legitimate here but it wasn't really my original question. I was just using [.5] as an example.
simplest example: fit a linear regression line through one point. There are an infinite number of solutions, that all fit the point exactly. So we cannot estimate constant and slope, but if we fix one, we can estimate the other parameter. Agreed, although a linear regression is not a beta distribution.
Or, in Christoph's example below you just get a mass point, degenerate solution, in other cases the Hessian will be singular.
I agree that a ML estimate of a Gaussian's variance makes little sense from one data point. In the case of a beta distribution, the ML estimate is more useful. I would prefer a Bayesian approach with a prior and full posterior but that could lead to another debate. But anyway I'm not trying to estimate the parameters from one data point, it was just an example. John.
participants (3)
-
Christoph Deil
-
John Reid
-
josef.pktd@gmail.com