scipy.stats.distributions: note on initial parameters for fitting the beta distribution
Here is a link on estimating initial parameters for fitting the beta distribution. I'm posting as it seems a good bit different from the current calculation in scipy.stats.distribution, and it calculates only the sample mean and biased sample variance. I am not competent to judge which of the two methods is better for scipy, and hope someone on the mailing list can do so. http://www.xycoon.com/beta_parameterestimation.htm James
On Mon, Oct 25, 2010 at 3:36 PM, James Phillips <zunzun@zunzun.com> wrote:
Here is a link on estimating initial parameters for fitting the beta distribution. I'm posting as it seems a good bit different from the current calculation in scipy.stats.distribution, and it calculates only the sample mean and biased sample variance. I am not competent to judge which of the two methods is better for scipy, and hope someone on the mailing list can do so.
This is exactly the same as the `fit` method in the scipy trunk version when loc and scale are fixed. The problem is only if you also want to estimate loc and scale, in that case standard mle doesn't really work. I'm not sure what the estimator in the _fitstart is, looks like matching skew and kurtosis. As Per mentioned in the other thread the best candidates for generic estimation will be other methods, like maximum product spacings (I never remember the official name), or maybe matching moments or similar, which is also one recommendation for extreme value distributions. Are you handling fixed loc and scale in your code? In that case, it might be possible just to restrict the usage of the beta distribution for data between zero and one, or fix loc=x.min()-eps, scale=(x.max() - x.min() + 2*eps) or something like this, if you don't want to get another estimation method. (I don't remember if I tried this for the beta distribution.) Josef
James _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Attached is a version of your file "matchdist.py" that is working well in my tests. It tests a half-dozen-ish different starting parameters for each continuous distribution - although this is often redundant - and uses the best fit from them based on the ks_pval statistic. It seems to run fast enough for use on my web site if I do not use the distributions ncf or kstwobign, so those are excluded in the file. I'll parallelize this code and make a few more tweaks, and then add it to my web site. I am concerned about the large number of warnings from scipy since they will fill up the web server error logs, but my understanding is that those warning messages can be suppressed or at least not sent to stderr. James
On Mon, Oct 25, 2010 at 3:14 PM, <josef.pktd@gmail.com> wrote:
This is exactly the same as the `fit` method in the scipy trunk...
Here is a more polished and quite smaller version of your example file matchdist.py that uses either nnlf or residuals for ranking, and includes checks for NaN, +inf and -inf. I think this has all of the logic and range checks that it needs. James 2010/10/30 James Phillips <zunzun@zunzun.com>:
I'll parallelize this code and make a few more tweaks, and then add it to my web site.
File attached. On Sun, Oct 31, 2010 at 8:33 AM, James Phillips <zunzun@zunzun.com> wrote:
Here is a more polished and quite smaller version of your example file matchdist.py that uses either nnlf or residuals for ranking, and includes checks for NaN, +inf and -inf. I think this has all of the logic and range checks that it needs.
James
2010/10/30 James Phillips <zunzun@zunzun.com>:
I'll parallelize this code and make a few more tweaks, and then add it to my web site.
On Sun, Oct 31, 2010 at 9:34 AM, James Phillips <zunzun@zunzun.com> wrote:
File attached.
On Sun, Oct 31, 2010 at 8:33 AM, James Phillips <zunzun@zunzun.com> wrote:
Here is a more polished and quite smaller version of your example file matchdist.py that uses either nnlf or residuals for ranking, and includes checks for NaN, +inf and -inf. I think this has all of the logic and range checks that it needs.
I tried out both your scripts during the weekend but didn't get around to replying. Most distributions work pretty fast but there are still a few unsuccesful time wasters in there. For example, ksone is also mainly a distribution for a statistical test, and in my run took a long time without a successful fit. I'm not sure how nnlf woill work as selection criterium for the distributions, and similarly the residual sum of squares might not be a good or robust criterium, for example with heavy tailed distributions. But that's just a guess, the only (commercial) package that I looked at, offered the choice between Kolmogorov-Smirnov, Anderson-Darling and 2 chisquare tests (equal-spaced and equal probability) as distance or goodness-of-fit measure. (Entropy would be another criterium, but I haven't seen it yet for selecting the distribution.) Your version also will help to narrow down what might be good starting values, eventually I would prefer to hardcode (optional) distribution specific _start_values. Your second script (after dropping the failures) has 10 distributions fewer than the first script (70 instead of 80), so there is still some distribution specific work left. Josef
James
2010/10/30 James Phillips <zunzun@zunzun.com>:
I'll parallelize this code and make a few more tweaks, and then add it to my web site.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
Thank you kindly for trying the scripts. James On Mon, Nov 1, 2010 at 10:13 PM, <josef.pktd@gmail.com> wrote:
I tried out both your scripts during the weekend but didn't get around to replying. Most distributions work pretty fast but there are still a few unsuccesful time wasters in there. For example, ksone is also mainly a distribution for a statistical test, and in my run took a long time without a successful fit.
I'm not sure how nnlf woill work as selection criterium for the distributions, and similarly the residual sum of squares might not be a good or robust criterium, for example with heavy tailed distributions. But that's just a guess, the only (commercial) package that I looked at, offered the choice between Kolmogorov-Smirnov, Anderson-Darling and 2 chisquare tests (equal-spaced and equal probability) as distance or goodness-of-fit measure. (Entropy would be another criterium, but I haven't seen it yet for selecting the distribution.)
Your version also will help to narrow down what might be good starting values, eventually I would prefer to hardcode (optional) distribution specific _start_values.
Your second script (after dropping the failures) has 10 distributions fewer than the first script (70 instead of 80), so there is still some distribution specific work left.
Josef
James
2010/10/30 James Phillips <zunzun@zunzun.com>:
I'll parallelize this code and make a few more tweaks, and then add it to my web site.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
I am finding that using eps alone is insufficient, as the size of a given floating point number may be too large to add or subtract eps and have the floating point number change. Below is example code that on my 32-bit Ubuntu Linux distribution illustrates my meaning. James import numpy eps = numpy.finfo(float).eps print 'eps =', eps a = 500.0 b = 500.0 print 'should be zero: a-b =', a-b c = b + eps print 'should be -eps, not zero: a-c =', a-c d = 1.0E-290 e = 1.0E-290 print 'should be zero: d-e =', d-e f = e + eps print 'should be -eps, not zero: a-c =', d-f On Mon, Oct 25, 2010 at 10:14 AM, <josef.pktd@gmail.com> wrote:
Are you handling fixed loc and scale in your code? In that case, it might be possible just to restrict the usage of the beta distribution for data between zero and one, or fix loc=x.min()-eps, scale=(x.max() - x.min() + 2*eps) or something like this, if you don't want to get another estimation method. (I don't remember if I tried this for the beta distribution.)
On Mon, Nov 8, 2010 at 11:43 AM, James Phillips <zunzun@zunzun.com> wrote:
I am finding that using eps alone is insufficient, as the size of a given floating point number may be too large to add or subtract eps and have the floating point number change. Below is example code that on my 32-bit Ubuntu Linux distribution illustrates my meaning.
For me eps in statistics, with accumulated errors and approximations is often only 1e-4, 1e-6, 1e-8, or 1e-10 depending on the problem, not machine epsilon. For example, if the calculations are based on a previous optimization or fsolve or numerical derivatives, then the precision is already limited by the convergence criteria. I usually try how small I can make eps without getting into problems. I don't know how it will work specifically for the beta case. Josef
James
import numpy
eps = numpy.finfo(float).eps print 'eps =', eps
a = 500.0 b = 500.0 print 'should be zero: a-b =', a-b
c = b + eps print 'should be -eps, not zero: a-c =', a-c
d = 1.0E-290 e = 1.0E-290 print 'should be zero: d-e =', d-e
f = e + eps print 'should be -eps, not zero: a-c =', d-f
On Mon, Oct 25, 2010 at 10:14 AM, <josef.pktd@gmail.com> wrote:
Are you handling fixed loc and scale in your code? In that case, it might be possible just to restrict the usage of the beta distribution for data between zero and one, or fix loc=x.min()-eps, scale=(x.max() - x.min() + 2*eps) or something like this, if you don't want to get another estimation method. (I don't remember if I tried this for the beta distribution.)
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
participants (2)
-
James Phillips
-
josef.pktd@gmail.com