[Numpy-discussion] bug in stats.randint

Thu Apr 23 12:56:20 EDT 2009

On Thu, Apr 23, 2009 at 11:18 AM, Flavio Coelho <fccoelho at gmail.com> wrote:
>
>
> On Thu, Apr 23, 2009 at 2:56 PM, <josef.pktd at gmail.com> wrote:
>>
>> On Thu, Apr 23, 2009 at 9:27 AM, Flavio Coelho <fccoelho at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I stumbled upon something I think is a bug in scipy:
>> >
>> > In [4]: stats.randint(1.,15.).ppf([.1,
>> > .2,.3,.4,.5])
>> > Out[4]: array([ 2.,  3.,  5.,  6.,  7.])
>> >
>> > When you pass float arguments to stats.randint and then call the ppf
>> > method,
>> > you get an array of floats, which clearly wrong. The rvs method doesn't
>> > display this bug so I think is a matter of poor type checking in the ppf
>> > implementation...
>> >
>>
>> I switched to using floats intentionally, to have correct handling of
>> inf and nans. and the argument checking is generic for all discrete
>> distributions and not special cased. Nans are converted to zero when
>> casting to integers, which is wrong and very confusing. inf raise an
>> exception. I prefer correct numbers to correct types. see examples
>> below
>
> I understand. I couldn't find, however, any parameterizations which makes
> randint return INF or NAN. So it should be safe to make randint return
> integers.
>
> I tested the equivalent function (for Poisson) in R (qpois) and it also
> returns INF when quantile is 1. so no problem there.
>
> I found some weird behaviors with stats.poisson.ppf:
>
> In [34]: stats.poisson.ppf([0,1])
> Out[34]: array([ -1.,  Inf]) #R returns [0,inf]
> In [35]: stats.poisson.ppf([0.,.1])
> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
>
> /home/fccoelho/<ipython console> in <module>()
>
> /usr/lib/python2.6/dist-packages/scipy/stats/distributions.pyc in ppf(self,
> q, *args, **kwds)
>    4002             goodargs = argsreduce(cond, *((q,)+args+(loc,)))
>    4003             loc, goodargs = goodargs[-1], goodargs[:-1]
> -> 4004             place(output,cond,self._ppf(*goodargs) + loc)
>    4005
>    4006         if output.ndim == 0:
>
> TypeError: _ppf() takes exactly 3 arguments (2 given)

poisson has one required paramater for the distribution

>>> print stats.poisson.extradoc
Poisson distribution
poisson.pmf(k, mu) = exp(-mu) * mu**k / k!
for k >= 0
>>> stats.poisson.numargs   # number of required distribution parameters
1

so raising an exception in stats.poisson.ppf([0.,.1]) is correct since
you didn't specify a parameter
stats.poisson.ppf([0.,.1],1) works

stats.poisson.ppf([0,1]) is getting the boundary treatment and is
never checked for valid args
However, the difference for the two cases is strange, and I need to
look closer at this.

stats.poisson.ppf([0,0.5,1])  raises an exception as expected.

The ppf at 0 was defined as "lower bound minus one" in Travis' pdf
files, which is consistent with the definition of the ppf
>>> stats.poisson.ppf(0,1)
-1.0

It always takes me a while to figure out how the inequality in ppf are
supposed to go, the best is to draw a step function and check the
inverse mapping.

>>> k=[-1,0,1,2]; stats.poisson.ppf(stats.poisson.cdf(k,1),1)
array([-1.,  0.,  1.,  2.])
>>> stats.poisson.ppf(stats.poisson.cdf(k,1)-1e-5,1)
array([ NaN,   0.,   1.,   2.])
>>> stats.poisson.ppf(stats.poisson.cdf(k,1)+1e-5,1)
array([ 0.,  1.,  2.,  3.])

If q are quantiles, then a half open interval, of q, open left, closed
right, maps to the same integer k.
so q=0 does not belong to the interval that maps to k=0.
This might be a convention, but is required for consistency in some
cases, which I don't remember.
I think defining stats.poisson.ppf(0,1)=0 violates some inequalities
and would need special casing.

>
> I hope these bug reports help.
> I think handling infinity and zero is almost problematic, maybe the best we
> can do is to aim for predictable behavior (possibly matching the behavior of
> equivalent R functions)
>
>
>
>
>
> I think that if randint would return ints no matter what the type of the
> parameters, it would make the behavior of my code a lot more predictable.
>>
>>
>> If you don't have inf and nans you can cast them to int yourself.

you just need to add "int(result)" for scalar or "result.astype(int)"
for array returns if you want to have ints.
But it is your responsibility to check that your code doesn't break
with nan, if you have an invalid input.
e.g.
>>> stats.poisson.ppf(0.5,-1)
nan

I had checked with R, when I made the type change, and what I remember
is that, for most distributions, R return floats for ppf, but not in
all packages or for all distributions.

>>
>> Josef
>>
>> >>> aint = np.zeros(5,dtype=int)
>> >>> aint[0]= np.nan
>> >>> aint
>> array([0, 0, 0, 0, 0])
>> >>> aint[1]= np.inf
>> Traceback (most recent call last):
>>  File "<pyshell#134>", line 1, in <module>
>> OverflowError: cannot convert float infinity to long
>>
>> >>> from scipy import stats
>> >>> stats.poisson.ppf(1,1)
>> inf
>> >>> stats.poisson.ppf(2,1)
>> nan
>>
>> >>> stats.poisson.ppf(1,1).astype(int)
>> -2147483648
>> >>> aint[2] = stats.poisson.ppf(1,1)
>> Traceback (most recent call last):
>>  File "<pyshell#140>", line 1, in <module>
>> OverflowError: cannot convert float infinity to long