[SciPy-User] bug in rankdata?

Sat Feb 16 13:22:57 EST 2013

On 2/15/13, Chris Rodgers <xrodgers at gmail.com> wrote:
> Thanks very much! I discovered this bug because mann-whitney U was
> giving me bizarre results, like a negative U statistic. My data is a
> large number of integer counts, mostly zeros, which is the worst case
> for ties.
>
> Until I can update scipy, I'll either write my own rankdata method,
> which will be very slow, or I'll use the R equivalent which is more
> feature-ful (but then I have to figure out rpy2 which will also be
> slow).

You could also try pandas (http://pandas.pydata.org/).  The DataFrame
and Series classes have a 'rank' method
(http://pandas.pydata.org/pandas-docs/stable/computation.html#data-ranking).

Warren

>
> On Fri, Feb 15, 2013 at 10:22 AM, Warren Weckesser
> <warren.weckesser at gmail.com> wrote:
>>
>>
>> On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser
>> <warren.weckesser at gmail.com> wrote:
>>>
>>> On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>>> > The results I'm getting from rankdata seem completely wrong for large
>>> > datasets. I'll illustrate with a case where all data are equal, so
>>> > every rank should be len(data) / 2 + 0.5.
>>> >
>>> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
>>> > Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
>>> > 5000.5])
>>> >
>>> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
>>> > Out[221]:
>>> > array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
>>> >         7050.82704,  7050.82704])
>>> >
>>> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
>>> > Out[222]:
>>> > array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
>>> >         1784.293664,  1784.293664])
>>> >
>>> > In [223]: scipy.__version__
>>> > Out[223]: '0.11.0'
>>> >
>>> > In [224]: numpy.__version__
>>> > Out[224]: '1.6.1'
>>> >
>>> >
>>> > The results are completely off for N>10000 or so. Am I doing something
>>> > wrong?
>>>
>>>
>>> Looks like a bug.  The code that accumulates the ranks of the tied
>>> values is using a 32 bit integer for the sum of the ranks, and this is
>>> overflowing.  I'll see if I can get this fixed for the imminent
>>> release of 0.12.
>>>
>>> Warren
>>>
>>
>>
>> A pull  request with the fix is here:
>> https://github.com/scipy/scipy/pull/436
>>
>>
>> Warren
>>
>>
>>>
>>> > _______________________________________________
>>> > SciPy-User mailing list
>>> > SciPy-User at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>> >
>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>