[SciPy-User] bug in rankdata?

Warren Weckesser warren.weckesser at gmail.com
Sat Feb 16 13:44:48 EST 2013


On 2/16/13, Warren Weckesser <warren.weckesser at gmail.com> wrote:
> On 2/15/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>> Thanks very much! I discovered this bug because mann-whitney U was
>> giving me bizarre results, like a negative U statistic. My data is a
>> large number of integer counts, mostly zeros, which is the worst case
>> for ties.
>>
>> Until I can update scipy, I'll either write my own rankdata method,
>> which will be very slow, or I'll use the R equivalent which is more
>> feature-ful (but then I have to figure out rpy2 which will also be
>> slow).
>
>
> You could also try pandas (http://pandas.pydata.org/).  The DataFrame
> and Series classes have a 'rank' method
> (http://pandas.pydata.org/pandas-docs/stable/computation.html#data-ranking).
>
> Warren


P.S. Since rankdata was rewritten in cython, the comment in the pandas
documentation about the pandas rank function being 10 to 20 times
faster than scipy.stats.rankdata is no longer true.  The pandas rank
function does provide more options for how the tied ranks are
assigned, which might be useful for you.

Warren


>
>
>>
>> On Fri, Feb 15, 2013 at 10:22 AM, Warren Weckesser
>> <warren.weckesser at gmail.com> wrote:
>>>
>>>
>>> On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser
>>> <warren.weckesser at gmail.com> wrote:
>>>>
>>>> On 2/14/13, Chris Rodgers <xrodgers at gmail.com> wrote:
>>>> > The results I'm getting from rankdata seem completely wrong for large
>>>> > datasets. I'll illustrate with a case where all data are equal, so
>>>> > every rank should be len(data) / 2 + 0.5.
>>>> >
>>>> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
>>>> > Out[220]: array([ 5000.5,  5000.5,  5000.5, ...,  5000.5,  5000.5,
>>>> > 5000.5])
>>>> >
>>>> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
>>>> > Out[221]:
>>>> > array([ 7050.82704,  7050.82704,  7050.82704, ...,  7050.82704,
>>>> >         7050.82704,  7050.82704])
>>>> >
>>>> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
>>>> > Out[222]:
>>>> > array([ 1784.293664,  1784.293664,  1784.293664, ...,  1784.293664,
>>>> >         1784.293664,  1784.293664])
>>>> >
>>>> > In [223]: scipy.__version__
>>>> > Out[223]: '0.11.0'
>>>> >
>>>> > In [224]: numpy.__version__
>>>> > Out[224]: '1.6.1'
>>>> >
>>>> >
>>>> > The results are completely off for N>10000 or so. Am I doing
>>>> > something
>>>> > wrong?
>>>>
>>>>
>>>> Looks like a bug.  The code that accumulates the ranks of the tied
>>>> values is using a 32 bit integer for the sum of the ranks, and this is
>>>> overflowing.  I'll see if I can get this fixed for the imminent
>>>> release of 0.12.
>>>>
>>>> Warren
>>>>
>>>
>>>
>>> A pull  request with the fix is here:
>>> https://github.com/scipy/scipy/pull/436
>>>
>>>
>>> Warren
>>>
>>>
>>>>
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > SciPy-User at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>



More information about the SciPy-User mailing list