[SciPy-User] bug in rankdata?

Wes McKinney wesmckinn at gmail.com
Sat Feb 16 13:59:33 EST 2013


On Sat, Feb 16, 2013 at 1:44 PM, Warren Weckesser
<warren.weckesser at gmail.com> wrote:
> On 2/16/13, Warren Weckesser <warren.weckesser at gmail.com> wrote:
>> On 2/15/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>>> Thanks very much! I discovered this bug because mann-whitney U was
>>> giving me bizarre results, like a negative U statistic. My data is a
>>> large number of integer counts, mostly zeros, which is the worst case
>>> for ties.
>>>
>>> Until I can update scipy, I'll either write my own rankdata method,
>>> which will be very slow, or I'll use the R equivalent which is more
>>> feature-ful (but then I have to figure out rpy2 which will also be
>>> slow).
>>
>>
>> You could also try pandas (http://pandas.pydata.org/).  The DataFrame
>> and Series classes have a 'rank' method
>> (http://pandas.pydata.org/pandas-docs/stable/computation.html#data-ranking).
>>
>> Warren
>
>
> P.S. Since rankdata was rewritten in cython, the comment in the pandas
> documentation about the pandas rank function being 10 to 20 times
> faster than scipy.stats.rankdata is no longer true.  The pandas rank
> function does provide more options for how the tied ranks are
> assigned, which might be useful for you.
>
> Warren
>

Oh, that's good to know. Will have to update the docs.

- Wes

>
>>
>>
>>>
>>> On Fri, Feb 15, 2013 at 10:22 AM, Warren Weckesser
>>> <warren.weckesser at gmail.com> wrote:
>>>>
>>>>
>>>> On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser
>>>> <warren.weckesser at gmail.com> wrote:
>>>>>
>>>>> On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>>>>> > The results I'm getting from rankdata seem completely wrong for large
>>>>> > datasets. I'll illustrate with a case where all data are equal, so
>>>>> > every rank should be len(data) / 2 + 0.5.
>>>>> >
>>>>> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
>>>>> > Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
>>>>> > 5000.5])
>>>>> >
>>>>> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
>>>>> > Out[221]:
>>>>> > array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
>>>>> >         7050.82704,  7050.82704])
>>>>> >
>>>>> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
>>>>> > Out[222]:
>>>>> > array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
>>>>> >         1784.293664,  1784.293664])
>>>>> >
>>>>> > In [223]: scipy.__version__
>>>>> > Out[223]: '0.11.0'
>>>>> >
>>>>> > In [224]: numpy.__version__
>>>>> > Out[224]: '1.6.1'
>>>>> >
>>>>> >
>>>>> > The results are completely off for N>10000 or so. Am I doing
>>>>> > something
>>>>> > wrong?
>>>>>
>>>>>
>>>>> Looks like a bug.  The code that accumulates the ranks of the tied
>>>>> values is using a 32 bit integer for the sum of the ranks, and this is
>>>>> overflowing.  I'll see if I can get this fixed for the imminent
>>>>> release of 0.12.
>>>>>
>>>>> Warren
>>>>>
>>>>
>>>>
>>>> A pull  request with the fix is here:
>>>> https://github.com/scipy/scipy/pull/436
>>>>
>>>>
>>>> Warren
>>>>
>>>>
>>>>>
>>>>> > _______________________________________________
>>>>> > SciPy-User mailing list
>>>>> > SciPy-User at scipy.org
>>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>> >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user



More information about the SciPy-User mailing list