[SciPy-User] Off by one bug in Scipy.stats.hypergeom

Sat Jul 31 22:27:30 EDT 2010

On Sat, Jul 31, 2010 at 8:48 PM, Jacob Biesinger
<jake.biesinger at gmail.com> wrote:
> Hi!
> Perhaps I'm using the module incorrectly, but it looks like the x parameter
> in scipy.stats.hypergeom is off by one.  Specifically, I think it's
> one-too-high.
> From the wikipedia article
> http://en.wikipedia.org/wiki/Hypergeometric_distribution#Application_and_example (I
> know they could be wrong-- just hear me out on this),
> scipy.stats.hypergeom?
> Hypergeometric distribution
>
>        Models drawing objects from a bin.
>        M is total number of objects, n is total number of Type I objects.
>        RV counts number of Type I objects in N drawn without replacement
> from
>        population.
> So translating wikipedia's example...
> Pr(x=4; M=50, n=5, N=10)  = (choose(5,4) * choose(50-5, 10-4)) /
> choose(50,10) = .003964583
> Pr(x=5; M=50, n=5, N=10)  = (choose(5,5) * choose(50-5, 10-5)) /
> choose(50,10) = .0001189375
> Which you can check with the python code:
> from scipy import comb as chse   # "combination" => choose
>
> float((chse(5,4, exact=1) * chse(50-5,10-4, exact=1))) / chse(50,10,exact=1)
>  # example one
> 0.0039645830580150654155  # okay!
> float((chse(5,5, exact=1) * chse(50-5,10-5, exact=1))) / chse(50,10,exact=1)
> # example two
> 0.00011893749174045196247  # okay!
> Try example one with scipy.stats.hypergeom:
> # scipy.stats.hypergeom.sf(x, M, n, N)
> scipy.stats.hypergeom.sf(4,50,5,10)
> 0.00011893749169422652     # correct value for x=5, not x=4
> scipy.stats.hypergeom.sf(5,50,5,10)
> -4.6185277824406512e-14    # wrong
> It seems that changing the loc value from =0 (default) to =1 fixes the
> issue...
> scipy.stats.hypergeom.sf(4,50,5,10, loc=1)
> 0.0040835205497095073    # close enough
> scipy.stats.hypergeom.sf(5,50,5,10, loc=1)
> 0.00011893749169422652   # okay!
> Am I using the package wrong?

I don't know why you are using the survival function.

>From some quick examples, hypergeom looks ok:

pmf has identical results to the Wikipedia example you referenced

>>> stats.hypergeom.pmf([4,5],50,5,10)
array([ 0.00396458,  0.00011894])
>>> stats.hypergeom.pmf(4,50,5,10)
0.0039645830580151411
>>> stats.hypergeom.pmf(5,50,5,10)
0.00011893749174045286

consistency of pmf, cdf, sf

>>> stats.hypergeom.pmf(np.arange(5),50,5,10).sum()
0.9998810625082829
>>> stats.hypergeom.cdf(4,50,5,10)
0.9998810625082829
>>> stats.hypergeom.sf(4,50,5,10)
0.00011893749171709711
>>> 1-stats.hypergeom.pmf(np.arange(5),50,5,10).sum()
0.00011893749171709711

I'm always glad when someone verifies the numbers in scipy stats and
appreciate any reports of inconsistencies or bugs. If there are
differences in the parameterization, then we should make sure that
they are sufficiently documented.
Most distributions are internally tested or against numpy.random and
there are very few tests with other packages.

Thanks,

Josef

> --
> Jake Biesinger
> Graduate Student
> Xie Lab, UC Irvine
> (949) 231-7587
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>