Warning about remaining issues in stats.distributions ?
There are still some known failures or methods, that don't work very well for some distributions: * skew. kurtosis: a few remaining incorrect results, no exact test available * entropy: returns nan in some cases, correctness of values not tested * fit: does not converge, or does not estimate correctly for some distributions, this is expected since maximum likelihood does not always work * distributions that have problems for some range of parameters * incorrect random number generator for one discrete distribution Is it useful to put a warning about these issues in stats.distributions in the release? A friend of mine is doing value-at-risk calculations for operational risk analysis, but I am still reluctant to recommend scipy.stats, when it is not clear whether the results are correct or not, although scripting in Python would be much easier than to struggle with some commercial package with a GUI user interface. The problem is that in this application they are using many distributions automatically, and not just a few that are well tested by frequent use. Per Brodtkorb has an improved version of distributions.py with an additional estimation procedure, which we can discuss as an enhancement after 0.7 is out of the way, if there is an interest in this. Josef
On Mon, Dec 1, 2008 at 8:06 AM, <josef.pktd@gmail.com> wrote:
There are still some known failures or methods, that don't work very well for some distributions:
* skew. kurtosis: a few remaining incorrect results, no exact test available * entropy: returns nan in some cases, correctness of values not tested * fit: does not converge, or does not estimate correctly for some distributions, this is expected since maximum likelihood does not always work * distributions that have problems for some range of parameters * incorrect random number generator for one discrete distribution
Is it useful to put a warning about these issues in stats.distributions in the release?
It would be great to add something like this to the docstrings (as long as you weren't proposing to raise a warning).
Per Brodtkorb has an improved version of distributions.py with an additional estimation procedure, which we can discuss as an enhancement after 0.7 is out of the way, if there is an interest in this.
That would be great. Once I create the 0.7.x branch, the trunk will be open for 0.8 development. Hopefully we can get this in ASAP in the 0.8 development. Thanks, -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
* distributions that have problems for some range of parameters so a good (imho) piece to add to unittests for the 'issues' to be fixed:
scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps) (tried on the SVN trunk to verify that it fails... discover the reason on your own ;-)) For myself I resolved it with __eps = N.sqrt(N.finfo(float).eps) rdist = rdist_gen(a=-1.0+__eps, b=1.0-__eps, .... but I am not sure if that is the cleanest way... and may be some other distributions would need such tweakery to make them more stable. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
On Tue, Dec 9, 2008 at 1:32 PM, Yaroslav Halchenko <lists@onerussian.com> wrote:
* distributions that have problems for some range of parameters so a good (imho) piece to add to unittests for the 'issues' to be fixed:
scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
In know that there are several problems where the calculations are not very precise at the boundaries. But I didn't know about this case and I hope to get rid of the exceptions. So please file tickets for any of these cases. In my testing so far, I was quite happy if I got results like these. stats.rdist(1.32, 0, 1).cdf(-1.0+1e-13) 1.2060822041379395e-009 But I guess there might still be a systematic problem with the treatment of open parameter spaces or open domains, c>0, x>0, which I have not tried out at all. Per Brodtkorb has proposed some improvements in numerical precision, which we will comitt after 0.7 is out. For some numerical methods e.g. ppf, it won't be possible to solve it arbitrarily close to 0 or 1 for every distribution, given the way the generic method finds the inverse function, in the test suite, I think, I used 0.001 and 0.999. In my fuzz testing there were some cases, where, for example, the shape parameter has the restriction c>0, but seems to work only for c>2, but I postponed these bugs for now, since I have to go over each distribution individually to try to figure out what the problem is. Additionally, for many distributions I don't know if anyone would ever need the distribution for that parameter range. Back to the rdist example In my bugfixes, I temporarily removed the rdist._cdf, since the generic method works also for large parameters, while special.hyp2f1(0.5,1.0-c/2.0,1.5,x*x) does not work for large c. Once I know over which domain the special functions are reliable, dispatching to the generic methods only for some part of the parameter space will be an improvement, but this requires some time consuming testing. So any help, reporting and patches are very welcome, especially from users who actually use the specific distribution. Josef
In my bugfixes, I temporarily removed the rdist._cdf, since the generic method works also for large parameters, while special.hyp2f1(0.5,1.0-c/2.0,1.5,x*x) does not work for large c. Once
I can only confirm that, and that is why I had to manually redefine rdist_gen in the same fashion to be used with elderly scipy 0.5.2-0.1 (with def cdf being removed, and also rdist.veccdf.nin = 2), otherwise I was getting arbitrary numbers in the output of cdf for some large values of c (10000) and various values of x in the tails.
I know over which domain the special functions are reliable, dispatching to the generic methods only for some part of the parameter space will be an improvement, but this requires some time consuming testing.
aren't CPU time is relatively cheap these days? Creating a simple test battery which would sample the space of arguments with emphasis on boundaries might prove to be informative to hunt down buried problems. Alternatively, indeed, it is possible just to wait for someone to hit the issue on his data / favorite distribution ;-) -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
heh heh... very sad to see that the warning was simply ignored and 0.7.0 still has this issue on exactly the same command (hence advice to include it to unittests was ignored as well):
print scipy.__version__ 0.7.0 scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 117, in cdf return self.dist.cdf(x,*self.args,**self.kwds) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 625, in cdf place(output,cond,self._cdf(*goodargs)) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 528, in _cdf return self.veccdf(x,*args) File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py", line 1886, in __call__ _res = array(self.ufunc(*newargs),copy=False, File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 525, in _cdf_single_call return scipy.integrate.quad(self._pdf, self.a, x, args=args)[0] File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 185, in quad retval = _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points) File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 249, in _quad return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 3046, in _pdf return pow((1.0-x*x),c/2.0-1) / special.beta(0.5,c/2.0) ZeroDivisionError: 0.0 cannot be raised to a negative power
and my workaround doesn't work any more so I need to look for another one. On Tue, 09 Dec 2008, Yaroslav Halchenko wrote:
* distributions that have problems for some range of parameters so a good (imho) piece to add to unittests for the 'issues' to be fixed:
scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
(tried on the SVN trunk to verify that it fails... discover the reason on your own ;-))
For myself I resolved it with
__eps = N.sqrt(N.finfo(float).eps) rdist = rdist_gen(a=-1.0+__eps, b=1.0-__eps, ....
but I am not sure if that is the cleanest way... and may be some other distributions would need such tweakery to make them more stable. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
On Thu, Mar 12, 2009 at 10:41 PM, Yaroslav Halchenko <lists@onerussian.com> wrote:
heh heh... very sad to see that the warning was simply ignored and 0.7.0 still has this issue on exactly the same command (hence advice to include it to unittests was ignored as well):
print scipy.__version__ 0.7.0 scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 117, in cdf return self.dist.cdf(x,*self.args,**self.kwds) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 625, in cdf place(output,cond,self._cdf(*goodargs)) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 528, in _cdf return self.veccdf(x,*args) File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py", line 1886, in __call__ _res = array(self.ufunc(*newargs),copy=False, File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 525, in _cdf_single_call return scipy.integrate.quad(self._pdf, self.a, x, args=args)[0] File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 185, in quad retval = _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points) File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 249, in _quad return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 3046, in _pdf return pow((1.0-x*x),c/2.0-1) / special.beta(0.5,c/2.0) ZeroDivisionError: 0.0 cannot be raised to a negative power
and my workaround doesn't work any more so I need to look for another one.
On Tue, 09 Dec 2008, Yaroslav Halchenko wrote:
* distributions that have problems for some range of parameters so a good (imho) piece to add to unittests for the 'issues' to be fixed:
scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
(tried on the SVN trunk to verify that it fails... discover the reason on your own ;-))
For myself I resolved it with
__eps = N.sqrt(N.finfo(float).eps) rdist = rdist_gen(a=-1.0+__eps, b=1.0-__eps, ....
but I am not sure if that is the cleanest way... and may be some other distributions would need such tweakery to make them more stable. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list. If there is a real use case that requires this, I can do a temporary fix. As far as I have seen, you use explicitly this special case as a test case and not a test that would reflect a failing use case. Overall, I prefer to have a general solution to the boundary problem for numerical integration, instead of messing around with the theoretically correct boundaries. Also, I would like to know what the references for the rdist are. Google search for r distribution is pretty useless, and I have not yet found a reference or an explanation of the rdist and its uses. Josef
On Thu, Mar 12, 2009 at 11:51 PM, <josef.pktd@gmail.com> wrote:
On Thu, Mar 12, 2009 at 10:41 PM, Yaroslav Halchenko <lists@onerussian.com> wrote:
heh heh... very sad to see that the warning was simply ignored and 0.7.0 still has this issue on exactly the same command (hence advice to include it to unittests was ignored as well):
print scipy.__version__ 0.7.0 scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 117, in cdf return self.dist.cdf(x,*self.args,**self.kwds) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 625, in cdf place(output,cond,self._cdf(*goodargs)) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 528, in _cdf return self.veccdf(x,*args) File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py", line 1886, in __call__ _res = array(self.ufunc(*newargs),copy=False, File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 525, in _cdf_single_call return scipy.integrate.quad(self._pdf, self.a, x, args=args)[0] File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 185, in quad retval = _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points) File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 249, in _quad return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit) File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 3046, in _pdf return pow((1.0-x*x),c/2.0-1) / special.beta(0.5,c/2.0) ZeroDivisionError: 0.0 cannot be raised to a negative power
and my workaround doesn't work any more so I need to look for another one.
On Tue, 09 Dec 2008, Yaroslav Halchenko wrote:
* distributions that have problems for some range of parameters so a good (imho) piece to add to unittests for the 'issues' to be fixed:
scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
(tried on the SVN trunk to verify that it fails... discover the reason on your own ;-))
For myself I resolved it with
__eps = N.sqrt(N.finfo(float).eps) rdist = rdist_gen(a=-1.0+__eps, b=1.0-__eps, ....
but I am not sure if that is the cleanest way... and may be some other distributions would need such tweakery to make them more stable. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list.
If there is a real use case that requires this, I can do a temporary fix. As far as I have seen, you use explicitly this special case as a test case and not a test that would reflect a failing use case. Overall, I prefer to have a general solution to the boundary problem for numerical integration, instead of messing around with the theoretically correct boundaries.
Also, I would like to know what the references for the rdist are. Google search for r distribution is pretty useless, and I have not yet found a reference or an explanation of the rdist and its uses.
Josef
numerical inprecision for calculation close to the boundary with
stats.rdist.a -1 rdist is symmetric, so the following two should be the same: stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12) 5.5125207421811825e-009 1-stats.rdist(1.32, 0, 1).cdf(1.0-1e-12) 1.8800738743607326e-011
If you want to avoid the exception for values very close to the boundary, then you can override the boundary yourself.
stats.rdist.a = -1 + sqrt(np.finfo(float).eps) sqrt(np.finfo(float).eps) 1.4901161193847656e-008
but then you loose numerical precision at the other points close to the boundary. The zero values are obtained because the integral is negative, the upper integration bound is smaller than the lower integration bound, and I guess, the check for validity of values in cdf sets out-of-bound (negative) values to zero.
stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-14) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-11) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-10) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-9) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-8) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-7) 7.8873658135486906e-006
compared with symmetric case
stats.rdist(1.32, 0, 1).sf(1.0-1e-7) 1.8275658764110858e-010
I don't know about the internals of scipy.integrate.quad and how it handles boundary points. For anything closer to the boundary than 1e-10 the estimated absolute error is larger than the integral. Closer to the boundary than 1e-14 raises an exception.
scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-8, args=(1.32,)) (2.4122318926575886e-006, 1.984389088393137e-012) scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-9, args=(1.32,)) (5.2774056254715333e-007, 5.6048532675329467e-012) scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-10, args=(1.32,)) (1.152923988934946e-007, 1.3805352057819491e-008) scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-11, args=(1.32,)) (2.5201736781677286e-008, 4.7714144348752183e-009)
this looks ok:
stats.rdist.a = -1 + np.finfo(float).eps scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-7, args=(1.32,)) (1.1026024914000647e-005, 2.4375839668006874e-012) scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-14, args=(1.32,)) (2.4315077563114524e-010, 3.8495287806628835e-012) stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps) 0.0 stats.rdist(1.32, 0, 1).cdf(-1.0+2*np.finfo(float).eps) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. 1.2660374321273558e-011 stats.rdist(1.32, 0, 1).cdf(-1.0+3*np.finfo(float).eps) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. 2.2899449003632814e-011 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-13) 1.1878387051893005e-009 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-10) 1.1528892986424152e-007
So, your solution for increasing the lower bound removes the exception but changes (screws up) the next range of values. But, introducing integration bounds different from dist.a and dist.b might work, but it requires testing to see if it works for all distributions, and we don't have test for all epsilon boundary cases. Josef
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list. fair enough, but still sad ;) it is just that I got frustrated since upgrade to 0.7.0 caused quite a few places to break, and this one was one of the 'soar' ones ;)
If there is a real use case that requires this, I can do a temporary well... I think I've mentioned before how I ran into this issue: our unittests of PyMVPA [1] fail. Primarily (I think) to my silly distribution matching function.
fix. As far as I have seen, you use explicitly this special case as a test case and not a test that would reflect a failing use case. well -- this test case is just an example. that distribution matching is the one which causes it in unittests
Overall, I prefer to have a general solution to the boundary problem for numerical integration, instead of messing around with the theoretically correct boundaries. sure! proper solution would be nice. as for "messing with boundaries": imho it depends on how 'they are messed up with" ;) May be original self.{a,b} could be left alone, but for numeric integration some others could be introduced (self.{a,b}_eps), which are used for integration and some correction term to be added whenever we are in the 'theoretical' boundaries, to compensate for "missing" part of [self.a, self.a_eps]
Most of the distributions would not need to have them different from a,b and have 0 correction. Testing is imho quite obvious -- just go through all distributions and try to obtain .cdf values within sample points in the vicinity of the boundaries. I know that it is know exhaustive if a distribution has singularities within, but well -- it is better than nothing imho ;) I bet you can come up with a better solution.
Also, I would like to know what the references for the rdist are. actually I've found empirically that rdist is the one which I needed, and indeed there is not much information on the web:
Google search for r distribution is pretty useless, and I have not yet found a reference or an explanation of the rdist and its uses.
rdist corresponds to the distribution of a (single) coordinate component for a point located on a hypersphere (in space of N dimensions) of radius 1. When N is large it is well approximated by Gaussian, but in the low dimensions it is very different and quite interesting (e.g. flat in N=3) n.b. actually my boss told me that there is a family of distributions where this one belongs to but I've forgotten which one ;) will ask today again there was just a single page which I ran to which described rdist and plotted sample pdfs. but can't find it now [1] http://www.pymvpa.org/ -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
On Fri, Mar 13, 2009 at 10:10 AM, Yaroslav Halchenko <lists@onerussian.com> wrote:
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list. fair enough, but still sad ;) it is just that I got frustrated since upgrade to 0.7.0 caused quite a few places to break, and this one was one of the 'soar' ones ;)
I ran the pymvpa test suite several times with uptodate versions of scipy and I usually didn't see any problems. I expected that, all my changes in stats so far should be backwards compatible, at least for parts that worked correctly before. If there are any problems you could be more specific and report them on the mailing list or open a ticket.
If there is a real use case that requires this, I can do a temporary well... I think I've mentioned before how I ran into this issue: our unittests of PyMVPA [1] fail. Primarily (I think) to my silly distribution matching function.
fix. As far as I have seen, you use explicitly this special case as a test case and not a test that would reflect a failing use case. well -- this test case is just an example. that distribution matching is the one which causes it in unittests
I only found the test for the rdist in test_stats.testRDistStability, and there it explicitely checks this corner case. In general the current implementation of the fit method still has several major problems to work out of the box for all distributions, especially those with a finite support boundary. Generic starting values don't work with many distributions, and fitting the location for distributions with finite support bounds using maximum likelihood often has problems. In pymvpa you also allow fitting of semifrozen distributions, is this your default in the use of MatchDistributions for distributions with bounded support as rdist, or half-open support. In your match distribution example I get many distributions with very bad kstest statistic, and when I tried out possible changes to automatically match distributions with fit, then this often indicated estimation problems and not necessarily a bad fit. But I did this for only a few sample distributions. For the largest part, I like the statistical analysis in pymvpa and you are far ahead of scipy.stats, eg. MatchDistribution and rv_semifrozen, or many of the other statistical methods. But in many cases, I don't like the actual implementation so much at least for a general statistical use. If you have an example where matching the rdist distribution actually raises your exception, then we could look at the fitting method and see whether we can make it more robust. I still don't see how any statistical method can hit boundary+eps. For rdist solution see below.
Overall, I prefer to have a general solution to the boundary problem for numerical integration, instead of messing around with the theoretically correct boundaries. sure! proper solution would be nice. as for "messing with boundaries": imho it depends on how 'they are messed up with" ;) May be original self.{a,b} could be left alone, but for numeric integration some others could be introduced (self.{a,b}_eps), which are used for integration and some correction term to be added whenever we are in the 'theoretical' boundaries, to compensate for "missing" part of [self.a, self.a_eps]
Most of the distributions would not need to have them different from a,b and have 0 correction. Testing is imho quite obvious -- just go through all distributions and try to obtain .cdf values within sample points in the vicinity of the boundaries. I know that it is know exhaustive if a distribution has singularities within, but well -- it is better than nothing imho ;)
I bet you can come up with a better solution.
Also, I would like to know what the references for the rdist are. actually I've found empirically that rdist is the one which I needed, and indeed there is not much information on the web:
rdist corresponds to the distribution of a (single) coordinate component for a point located on a hypersphere (in space of N dimensions) of radius 1. When N is large it is well approximated by Gaussian, but in the low dimensions it is very different and quite interesting (e.g. flat in N=3)
n.b. actually my boss told me that there is a family of distributions where this one belongs to but I've forgotten which one ;) will ask today again
Google search for r distribution is pretty useless, and I have not yet found a reference or an explanation of the rdist and its uses. there was just a single page which I ran to which described rdist and plotted sample pdfs. but can't find it now
I read somewhere, I don't remember where that rdist is the distribution of the correlation coefficient, but without more information that's pretty useless
[1] http://www.pymvpa.org/ --
The solution to the rdist problem is trivial:
np.power(1-1.,-2) inf pow(1-1.,-2) Traceback (most recent call last): File "<pyshell#163>", line 1, in <module> pow(1-1.,-2) ZeroDivisionError: 0.0 cannot be raised to a negative power
I hate missing name spaces, I didn't know that pow is the python buildin and not numpy.power numpy.power can raise 0.0 to the negative power, I just have to remove the usage or python's pow. Much better than fiddling with boundaries. I still have to test it, but this looks ok. I was wondering why you used the square root in your eps increase on the boundary, because in all my trying in the python shell it seems to work also without. But the problem is that I was using a numpy float instead of a python float and so numpy.power was used instead of python pow, (I assume):
(-1+(1-(1e-14)**2))**(-2) Traceback (most recent call last): File "<pyshell#170>", line 1, in <module> (-1+(1-(1e-14)**2))**(-2) ZeroDivisionError: 0.0 cannot be raised to a negative power (-1+(1-np.finfo(float).eps**2))**(-1) inf
(-1+(1-(1e-14)**2)).__class__ <type 'float'> (-1+(1-np.finfo(float).eps**2)).__class__ <type 'numpy.float64'>
Josef
The solution to the rdist problem is trivial:
np.power(1-1.,-2) inf pow(1-1.,-2) Traceback (most recent call last): File "<pyshell#163>", line 1, in <module> pow(1-1.,-2) ZeroDivisionError: 0.0 cannot be raised to a negative power
It only partially helped, it doesn't raise an exception anymore, but the inf is incorrect
stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. Warning: Extremely bad integrand behavior occurs at some points of the integration interval. inf stats.rdist(1.32, 0, 1).cdf(-1.0+1e-14) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. inf
further away from the boundary it looks good
stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12) 5.5125207421811825e-009 stats.rdist._cdf_skip(-1.0+1e-12, 1.32) 5.5260225839681709e-009 stats.rdist._cdf_skip(-1.0+1e-13, 1.32) 1.2092278844910709e-009 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-13) 1.2060822041379395e-009
so back to fiddling with the boundary ? so just do
stats.rdist.a = -1.0+np.finfo(float).eps before calling rdist, this works in some basic tests.
Josef
Thanks Josef once again. I have tested s/pow/N.power/ in rdist and it seems to work great! thanks a lot. numpy is more fun though than I thought ;) In [12]:np.power(0, -1) Out[12]:-9223372036854775808 In [14]:np.power(0.0, -1) Out[14]:inf In [16]:np.power(0, -1.0) Out[16]:inf I know that in my case I will not (or should not) get ints as input but... I thought it might be worth mentioning... may be it is worth a bug report in numpy? On Fri, 13 Mar 2009, josef.pktd@gmail.com wrote:
The solution to the rdist problem is trivial:
np.power(1-1.,-2) inf pow(1-1.,-2) Traceback (most recent call last): File "<pyshell#163>", line 1, in <module> pow(1-1.,-2) ZeroDivisionError: 0.0 cannot be raised to a negative power
It only partially helped, it doesn't raise an exception anymore, but the inf is incorrect
stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. Warning: Extremely bad integrand behavior occurs at some points of the integration interval. inf stats.rdist(1.32, 0, 1).cdf(-1.0+1e-14) Warning: Extremely bad integrand behavior occurs at some points of the integration interval. inf
further away from the boundary it looks good
stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12) 5.5125207421811825e-009 stats.rdist._cdf_skip(-1.0+1e-12, 1.32) 5.5260225839681709e-009 stats.rdist._cdf_skip(-1.0+1e-13, 1.32) 1.2092278844910709e-009 stats.rdist(1.32, 0, 1).cdf(-1.0+1e-13) 1.2060822041379395e-009
so back to fiddling with the boundary ? so just do
stats.rdist.a = -1.0+np.finfo(float).eps before calling rdist, this works in some basic tests.
Josef _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
-- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
Google search for r distribution is pretty useless, and I have not yet found a reference or an explanation of the rdist and its uses. there was just a single page which I ran to which described rdist and plotted sample pdfs. but can't find it now I read somewhere, I don't remember where that rdist is the distribution of the correlation coefficient, but without more information that's pretty useless doh! sure it is related... hence the name rdist, since pearsons corr coeff is abbreviated as 'r' ;) hence rdist ;)
http://en.wikipedia.org/wiki/Correlation_coefficient says that The distribution of the correlation coefficient has been examined by R. A. Fisher[2][3] and A. K. Gayen.[4] but those are 100 and 50 years old books... not sure if we have them online to check if they were the one who brought analytic function for it... and it seems that it is related to the 'multidimensional' correlation mentioned in the wikipedia but it is now clear how sample size "fits into equation"... c seems to relate to the dimensions of the data... is it possible to trace back who introduced this lovely piece into scipy? ;) may be we could ask the author? ;) -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
I was wondering why you used the square root in your eps increase on the boundary, because in all my trying in the python shell it seems to work also without. But the problem is that I was using a numpy float instead of a python float and so numpy.power was used instead of python pow, (I assume):
doh... just now realized that some pieces don't come together... whenever I wrote you reply (while reading your email) I killed everything till the signature which commonly starts with '^^-$'... hence I killed till this lines and never saw your power finding ;) cool!!! so if it works all-around (or whenever you think that you are sick of testing it) -- please let me know -- I will adopt this solution in my monkey patch within pymvpa ;) THANKS! P.S. Told ya you will come up with a nicer one ;) probably ;) -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555]
On Thu, Mar 12, 2009 at 23:51, <josef.pktd@gmail.com> wrote:
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list.
I've been working with some code for integration using the double-exponential transform, which is useful for exactly this case. I'm not familiar with all the issues involved here, but if you would like to explore that option, let me know and I'll send you the code. I was planning on cleaning it up a bit and contributing it to scipy, anyway.
On Sat, Mar 14, 2009 at 8:56 PM, Thouis (Ray) Jones <thouis@broad.mit.edu> wrote:
On Thu, Mar 12, 2009 at 23:51, <josef.pktd@gmail.com> wrote:
Fixing numerical integration over the distance of a machine epsilon of a function that has a singularity at the boundary was not very high on my priority list.
I've been working with some code for integration using the double-exponential transform, which is useful for exactly this case. I'm not familiar with all the issues involved here, but if you would like to explore that option, let me know and I'll send you the code.
I was planning on cleaning it up a bit and contributing it to scipy, anyway.
Thank you for the offer, it would be good to have some more tools available to handle these special cases. There are several distributions that have a point in the interior or the boundary of the support where the density function goes to infinity. In general scipy.integrate.quad seems to be handling them quite well except for corner cases. But I don't know what the precision loss in these cases is. For the specific case of the r-distribution, rdist, the main source of the problem is that I had to disable the explicit formula for the cumulative distribution function because scipy.special.hyp2f1 produces incorrect numbers over part of the parameter range used in rdist, and I had to fall back to numerical integration. But for me there is no hurry for these corner case problems, since, in my available time, I still have other problems or improvements in stats to chase, (and for me it's not so much fun to figure out what epsilon multiplied by (going to infinity) is, especially if it's not clear what the use case for it is.) Josef
participants (4)
-
Jarrod Millman -
josef.pktd@gmail.com -
Thouis (Ray) Jones -
Yaroslav Halchenko