[Numpy-discussion] Recarray sort() slowness

Sun Mar 4 14:30:03 EST 2007

El dg 04 de 03 del 2007 a les 08:54 -0700, en/na Charles R Harris va
escriure:
> 
> 
> On 3/4/07, Charles R Harris <charlesr.harris at gmail.com> wrote:
>         
>         
>         On 3/4/07, Francesc Altet <faltet at carabos.com> wrote:
>                 Hi,
>                 
>                 I've finally implemented Chuck's suggestion of sorting
>                 of a recarray of
>                 two fields (the first being the actual array to be
>                 sorted and the other
>                 being the array to be reordered following the
>                 resulting order for the 
>                 first one). Indeed, this approach saves me an amount
>                 of memory
>                 equivalent to the size of the second column, which is
>                 really nice.
>                 
>                 However, I'm afraid that I wouldn't be able to use
>                 this approach as it 
>                 is 25x slower (see the attached benchmark; beware!
>                 only runs on a Linux
>                 kernel 2.6!) than regular argsorting of the first
>                 field and then doing a
>                 fancy indexing over the second. If the slowdown would
>                 be 2x I still can 
>                 have a chance to use it, but 25x is a no go.
>                 
>                 I'm curious why the recarray.sort(order='fieldN') is
>                 so slow, and I'm
>                 wondering if this can be speed-up in some way or
>                 another.
>         
>         I suspect there are several reasons.
>         
>         1) It defines a new type with the comparison done on all
>         fields
>         2) exchanges are done by copying the specified number of bytes
>         
>         I think Travis was clever to define a new type, it made things
>         easy and very general, but it wasn't aimed at speed. There
>         might be some optimizations possible in there that Travis
>         could speak to. 
>         
>         It would be pretty easy to modify argsort itself to do what
>         you want in a type specific way using a key vector and a
>         vector to be sorted by the keys. I expect it would be about
>         1/2 as fast as the normal argsort. Hmmm, something like
>         keysort(...).
> 
> To expand a bit, argsort sorts a vector of indices on the keys. IIRC,
> it doesn't exchange the keys (a tradeoff between exchange overhead and
> cache locality), so that would have to be changed, and instead of
> passing in an array of indices you would pass in the array you want to
> sort. 

Yes, I think I got the point. Something like:

numpy.keysort(vector1, vector2)

that would sort vector1 in-place and also reorder vector2 but following
vector1 order at the same time. Of course, different dtypes for vector1
and vector2 should be supported (although the shape should be the same).
It would be a nice thing to have in NumPy indeed (and would fit
perfectly to my needs :).

Cheers,

> 
-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth