[Numpy-discussion] argsort speed

Julian Taylor jtaylor.debian at googlemail.com
Mon Feb 17 13:32:18 EST 2014


On 17.02.2014 15:18, Francesc Alted wrote:
> On 2/17/14, 1:08 AM, josef.pktd at gmail.com wrote:
>> On Sun, Feb 16, 2014 at 6:12 PM, Daπid <davidmenhur at gmail.com> wrote:
>>> On 16 February 2014 23:43, <josef.pktd at gmail.com> wrote:
>>>> What's the fastest argsort for a 1d array with around 28 Million
>>>> elements, roughly uniformly distributed, random order?
>>>
>>> On numpy latest version:
>>>
>>> for kind in ['quicksort', 'mergesort', 'heapsort']:
>>>      print kind
>>>      %timeit np.sort(data, kind=kind)
>>>      %timeit np.argsort(data, kind=kind)
>>>
>>>
>>> quicksort
>>> 1 loops, best of 3: 3.55 s per loop
>>> 1 loops, best of 3: 10.3 s per loop
>>> mergesort
>>> 1 loops, best of 3: 4.84 s per loop
>>> 1 loops, best of 3: 9.49 s per loop
>>> heapsort
>>> 1 loops, best of 3: 12.1 s per loop
>>> 1 loops, best of 3: 39.3 s per loop
>>>
>>>
>>> It looks quicksort is quicker sorting, but mergesort is marginally faster
>>> sorting args. The diference is slim, but upon repetition, it remains
>>> significant.
>>>
>>> Why is that? Probably part of the reason is what Eelco said, and part is
>>> that in sort comparison are done accessing the array elements directly, but
>>> in argsort you have to index the array, introducing some overhead.
>> Thanks, both.
>>
>> I also gain a second with mergesort.
>>
>> matlab would be nicer in my case, it returns both.
>> I still need to use the argsort to index into the array to also get
>> the sorted array.
> 
> Many years ago I needed something similar, so I made some functions for 
> sorting and argsorting in one single shot.  Maybe you want to reuse 
> them.  Here it is an example of the C implementation:
> 
> https://github.com/PyTables/PyTables/blob/develop/src/idx-opt.c#L619
> 
> and here the Cython wrapper for all of them:
> 
> https://github.com/PyTables/PyTables/blob/develop/tables/indexesextension.pyx#L129
> 
> Francesc
> 

that doesn't really make a big difference if the data is randomly
distributed.
the sorting operation is normally much more expensive than latter
applying the indices:

In [1]: d = np.arange(10000000)

In [2]: np.random.shuffle(d)

In [3]: %timeit  np.argsort(d)
1 loops, best of 3: 1.99 s per loop

In [4]: idx = np.argsort(d)

In [5]: %timeit d[idx]
1 loops, best of 3: 213 ms per loop



But if your data is not random it can make a difference as even
quicksort can be a lot faster then.
timsort would be a nice addition to numpy, it performs very well for
partially sorted data. Unfortunately its quite complicated to implement.



More information about the NumPy-Discussion mailing list