[SciPy-dev] Sad sad sad... Was: Warning about remaining issues in stats.distributions ?

Fri Mar 13 12:12:45 EDT 2009

On Fri, Mar 13, 2009 at 10:10 AM, Yaroslav Halchenko
<lists at onerussian.com> wrote:
>
>> Fixing numerical integration over the distance of a machine epsilon of
>> a function that has a singularity at the boundary was not very high on
>> my priority list.
> fair enough, but still sad ;)
> it is just that I got frustrated since upgrade to 0.7.0 caused quite a
> few places to break, and this one was one of the 'soar' ones ;)

I ran the pymvpa test suite several times with uptodate versions of scipy
and I usually didn't see any problems.
I expected that, all my changes in stats so far should be backwards
compatible, at least for parts that worked correctly before. If there are
any problems you could be more specific and report them on the
mailing list or open a ticket.

>
>> If there is a real use case that requires this, I can do a temporary
> well... I think I've mentioned before how I ran into this issue: our
> unittests of PyMVPA [1] fail.  Primarily (I think) to my silly
> distribution matching function.
>
>> fix. As far as I have seen, you use explicitly this special case as a
>> test case and not a test that would reflect a failing use case.
> well -- this test case is just an example. that distribution matching is
> the one which causes it in unittests

I only found the test for the rdist in test_stats.testRDistStability, and
there it explicitely checks this corner case.

In general the current implementation of the fit method still has several
major problems to work out of the box for all distributions, especially
those with a finite support boundary. Generic starting values
don't work with many distributions, and fitting the location for distributions
with finite support bounds using maximum likelihood often has problems.

In pymvpa you also allow fitting of semifrozen distributions, is this your
default in the use of MatchDistributions for distributions with bounded
support as rdist, or half-open support. In your match distribution
example I get many distributions with very bad kstest statistic, and when
I tried out possible changes to automatically match distributions with fit,
then this often indicated estimation problems and not necessarily a
bad fit. But I did this for only a few sample distributions.

For the largest part, I like the statistical analysis in pymvpa and you are
far ahead of scipy.stats, eg. MatchDistribution and rv_semifrozen, or
many of the other statistical methods. But in many cases, I don't like
the actual implementation so much at least for a general statistical
use.

If you have an example where matching the rdist distribution actually
raises your exception, then we could look at the fitting method and
see whether we can make it more robust. I still don't see how
any statistical method can hit boundary+eps. For rdist solution see
below.

>
>> Overall, I prefer to have a general solution to the boundary problem
>> for numerical integration, instead of messing around with the
>> theoretically correct boundaries.
> sure! proper solution would be nice.  as for "messing with boundaries":
> imho it depends on how 'they are messed up with" ;) May be original
> self.{a,b} could be left alone, but for numeric integration some others
> could be introduced (self.{a,b}_eps), which are used for integration and
> some correction term to be added whenever we are in the 'theoretical'
> boundaries, to compensate for "missing" part of [self.a, self.a_eps]
>
> Most of the distributions would not need to have them different from a,b
> and have 0 correction.
> Testing is imho quite obvious -- just go through all distributions and
> try to obtain .cdf values within sample points in the vicinity of the
> boundaries. I know that it is know exhaustive if a distribution has
> singularities within, but well -- it is better than nothing imho ;)
>
> I bet you can come up with a better solution.
>
>> Also, I would like to know what the references for the rdist are.
> actually I've found empirically that rdist is the one which I needed,
> and indeed there is not much information on the web:
>
> rdist corresponds to the distribution of a (single) coordinate
> component for a point located on a hypersphere (in space of N
> dimensions) of radius 1. When  N is large it is well approximated by
> Gaussian, but in the low dimensions it is very
> different and quite interesting (e.g. flat in N=3)
>
> n.b. actually my boss told me that there is a family of distributions where
> this one belongs to but I've forgotten which one ;) will ask today again
>
>> Google search for r distribution is pretty useless, and I have not yet
>> found a reference or an explanation of the rdist and its uses.
> there was just a single page which I ran to which described rdist and
> plotted sample pdfs. but can't find it now

I read somewhere, I don't remember where that rdist is the distribution
of the correlation coefficient, but without more information that's pretty
useless

>
> [1] http://www.pymvpa.org/
> --

The solution to the rdist problem is trivial:

>>> np.power(1-1.,-2)
inf
>>> pow(1-1.,-2)
Traceback (most recent call last):
  File "<pyshell#163>", line 1, in <module>
    pow(1-1.,-2)
ZeroDivisionError: 0.0 cannot be raised to a negative power

I hate missing name spaces, I didn't know that pow is the python
buildin and not numpy.power
numpy.power can raise 0.0 to the negative power, I just have to remove
the usage or python's pow.
Much better than fiddling with boundaries. I still have to test it,
but this looks ok.

I was wondering why you used the square root in your eps increase on
the boundary, because in all my trying in the python shell it seems to
work also without. But the problem is that I was using a numpy float
instead of a python float and so numpy.power was used instead of
python pow, (I assume):

>>> (-1+(1-(1e-14)**2))**(-2)
Traceback (most recent call last):
  File "<pyshell#170>", line 1, in <module>
    (-1+(1-(1e-14)**2))**(-2)
ZeroDivisionError: 0.0 cannot be raised to a negative power
>>> (-1+(1-np.finfo(float).eps**2))**(-1)
inf

>>> (-1+(1-(1e-14)**2)).__class__
<type 'float'>
>>> (-1+(1-np.finfo(float).eps**2)).__class__
<type 'numpy.float64'>

Josef