[SciPy-dev] Sad sad sad... Was: Warning about remaining issues in stats.distributions ?

Fri Mar 13 01:11:05 EDT 2009

On Thu, Mar 12, 2009 at 11:51 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Mar 12, 2009 at 10:41 PM, Yaroslav Halchenko
> <lists at onerussian.com> wrote:
>> heh heh... very sad to see that the warning was simply ignored and 0.7.0
>> still has this issue on exactly the same command (hence advice to include
>> it to unittests was ignored as well):
>>
>>>>> print scipy.__version__
>> 0.7.0
>>>>> scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
>> Traceback (most recent call last):
>>  File "<stdin>", line 1, in <module>
>>  File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 117, in cdf
>>    return self.dist.cdf(x,*self.args,**self.kwds)
>>  File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 625, in cdf
>>    place(output,cond,self._cdf(*goodargs))
>>  File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 528, in _cdf
>>    return self.veccdf(x,*args)
>>  File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py", line 1886, in __call__
>>    _res = array(self.ufunc(*newargs),copy=False,
>>  File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 525, in _cdf_single_call
>>    return scipy.integrate.quad(self._pdf, self.a, x, args=args)[0]
>>  File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 185, in quad
>>    retval = _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points)
>>  File "/usr/lib/python2.5/site-packages/scipy/integrate/quadpack.py", line 249, in _quad
>>    return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
>>  File "/usr/lib/python2.5/site-packages/scipy/stats/distributions.py", line 3046, in _pdf
>>    return pow((1.0-x*x),c/2.0-1) / special.beta(0.5,c/2.0)
>> ZeroDivisionError: 0.0 cannot be raised to a negative power
>>
>> and my workaround doesn't work any more so I need to look for another
>> one.
>>
>>
>> On Tue, 09 Dec 2008, Yaroslav Halchenko wrote:
>>
>>> > * distributions that have problems for some range of parameters
>>> so a good (imho) piece to add to unittests for the 'issues' to be fixed:
>>
>>> scipy.stats.rdist(1.32, 0, 1).cdf(-1.0+numpy.finfo(float).eps)
>>
>>> (tried on the SVN trunk to verify that it fails... discover the reason on
>>> your own ;-))
>>
>>> For myself I resolved it with
>>
>>>     __eps = N.sqrt(N.finfo(float).eps)
>>>     rdist = rdist_gen(a=-1.0+__eps, b=1.0-__eps, ....
>>
>>> but I am not sure if that is the cleanest way... and may be some other
>>> distributions would need such tweakery to make them more stable.
>> --
>>                                  .-.
>> =------------------------------   /v\  ----------------------------=
>> Keep in touch                    // \\     (yoh@|www.)onerussian.com
>> Yaroslav Halchenko              /(   )\               ICQ#: 60653192
>>                   Linux User    ^^-^^    [175555]
>>
>>
>
> Fixing numerical integration over the distance of a machine epsilon of
> a function that has a singularity at the boundary was not very high on
> my priority list.
>
> If there is a real use case that requires this, I can do a temporary
> fix. As far as I have seen, you use explicitly this special case as a
> test case and not a test that would reflect a failing use case.
> Overall, I prefer to have a general solution to the boundary problem
> for numerical integration, instead of messing around with the
> theoretically correct boundaries.
>
> Also, I would like to know what the references for the rdist are.
> Google search for r distribution is pretty useless, and I have not yet
> found a reference or an explanation of the rdist and its uses.
>
> Josef
>

numerical inprecision for calculation close to the boundary with
>>> stats.rdist.a
-1
rdist is symmetric, so the following two should be the same:
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12)
5.5125207421811825e-009
>>> 1-stats.rdist(1.32, 0, 1).cdf(1.0-1e-12)
1.8800738743607326e-011

If you want to avoid the exception for values very close to the
boundary, then you can override the boundary yourself.

>>> stats.rdist.a = -1 + sqrt(np.finfo(float).eps)
>>> sqrt(np.finfo(float).eps)
1.4901161193847656e-008

but then you loose numerical precision at the other points close to
the boundary. The zero values are obtained because the integral is
negative, the upper integration bound is smaller than the lower
integration bound, and I guess, the check for validity of values in
cdf sets out-of-bound (negative) values to zero.

>>> stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-14)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-12)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-11)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-10)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-9)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-8)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-7)
7.8873658135486906e-006

compared with symmetric case
>>> stats.rdist(1.32, 0, 1).sf(1.0-1e-7)
1.8275658764110858e-010

I don't know about the internals of scipy.integrate.quad and how it
handles boundary points.
For anything closer to the boundary than 1e-10 the estimated absolute
error is larger than the
integral. Closer to the boundary than 1e-14 raises an exception.

>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-8, args=(1.32,))
(2.4122318926575886e-006, 1.984389088393137e-012)
>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-9, args=(1.32,))
(5.2774056254715333e-007, 5.6048532675329467e-012)
>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-10, args=(1.32,))
(1.152923988934946e-007, 1.3805352057819491e-008)
>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-11, args=(1.32,))
(2.5201736781677286e-008, 4.7714144348752183e-009)

this looks ok:

>>> stats.rdist.a = -1 + np.finfo(float).eps
>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-7, args=(1.32,))
(1.1026024914000647e-005, 2.4375839668006874e-012)
>>> scipy.integrate.quad(stats.rdist._pdf, stats.rdist.a, -1+1e-14, args=(1.32,))
(2.4315077563114524e-010, 3.8495287806628835e-012)
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+np.finfo(float).eps)
0.0
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+2*np.finfo(float).eps)
Warning: Extremely bad integrand behavior occurs at some points of the
  integration interval.
1.2660374321273558e-011
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+3*np.finfo(float).eps)
Warning: Extremely bad integrand behavior occurs at some points of the
  integration interval.
2.2899449003632814e-011
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-13)
1.1878387051893005e-009
>>> stats.rdist(1.32, 0, 1).cdf(-1.0+1e-10)
1.1528892986424152e-007

So, your solution for increasing the lower bound removes the exception
but changes (screws up) the next range of values. But, introducing
integration bounds different from dist.a and dist.b might work, but it
requires testing to see if it works for all distributions, and we
don't have test for all epsilon boundary cases.

Josef