[Numpy-discussion] argsort speed
Charles R Harris
charlesr.harris at gmail.com
Mon Feb 17 13:40:50 EST 2014
On Mon, Feb 17, 2014 at 11:32 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
> On 17.02.2014 15:18, Francesc Alted wrote:
> > On 2/17/14, 1:08 AM, josef.pktd at gmail.com wrote:
> >> On Sun, Feb 16, 2014 at 6:12 PM, Daπid <davidmenhur at gmail.com> wrote:
> >>> On 16 February 2014 23:43, <josef.pktd at gmail.com> wrote:
> >>>> What's the fastest argsort for a 1d array with around 28 Million
> >>>> elements, roughly uniformly distributed, random order?
> >>>
> >>> On numpy latest version:
> >>>
> >>> for kind in ['quicksort', 'mergesort', 'heapsort']:
> >>> print kind
> >>> %timeit np.sort(data, kind=kind)
> >>> %timeit np.argsort(data, kind=kind)
> >>>
> >>>
> >>> quicksort
> >>> 1 loops, best of 3: 3.55 s per loop
> >>> 1 loops, best of 3: 10.3 s per loop
> >>> mergesort
> >>> 1 loops, best of 3: 4.84 s per loop
> >>> 1 loops, best of 3: 9.49 s per loop
> >>> heapsort
> >>> 1 loops, best of 3: 12.1 s per loop
> >>> 1 loops, best of 3: 39.3 s per loop
> >>>
> >>>
> >>> It looks quicksort is quicker sorting, but mergesort is marginally
> faster
> >>> sorting args. The diference is slim, but upon repetition, it remains
> >>> significant.
> >>>
> >>> Why is that? Probably part of the reason is what Eelco said, and part
> is
> >>> that in sort comparison are done accessing the array elements
> directly, but
> >>> in argsort you have to index the array, introducing some overhead.
> >> Thanks, both.
> >>
> >> I also gain a second with mergesort.
> >>
> >> matlab would be nicer in my case, it returns both.
> >> I still need to use the argsort to index into the array to also get
> >> the sorted array.
> >
> > Many years ago I needed something similar, so I made some functions for
> > sorting and argsorting in one single shot. Maybe you want to reuse
> > them. Here it is an example of the C implementation:
> >
> > https://github.com/PyTables/PyTables/blob/develop/src/idx-opt.c#L619
> >
> > and here the Cython wrapper for all of them:
> >
> >
> https://github.com/PyTables/PyTables/blob/develop/tables/indexesextension.pyx#L129
> >
> > Francesc
> >
>
> that doesn't really make a big difference if the data is randomly
> distributed.
> the sorting operation is normally much more expensive than latter
> applying the indices:
>
> In [1]: d = np.arange(10000000)
>
> In [2]: np.random.shuffle(d)
>
> In [3]: %timeit np.argsort(d)
> 1 loops, best of 3: 1.99 s per loop
>
> In [4]: idx = np.argsort(d)
>
> In [5]: %timeit d[idx]
> 1 loops, best of 3: 213 ms per loop
>
>
>
> But if your data is not random it can make a difference as even
> quicksort can be a lot faster then.
> timsort would be a nice addition to numpy, it performs very well for
> partially sorted data. Unfortunately its quite complicated to implement.
>
Quicksort and shellsort gain speed by having simple inner loops. I have the
impression that timsort is optimal when compares and memory access are
expensive, but I haven't seen any benchmarks for native types in contiguous
memory.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140217/31bdceff/attachment.html>
More information about the NumPy-Discussion
mailing list