[Numpy-discussion] String sort

Charles R Harris charlesr.harris at gmail.com
Sat Feb 9 21:54:21 EST 2008


On Feb 9, 2008 3:19 PM, Francesc Altet <faltet at carabos.com> wrote:

> A Saturday 09 February 2008, Charles R Harris escrigué:
> > I'm also thinking that at some point it becomes more efficient to do
> > a indirect sort followed by take than to move all those big strings
> > around. But I guess we won't know where that point is until we have
> > both versions available.
>
> I've done some experiments in that matter too.  They are saying that,
> with the current mergesort in NumPy, an indirect sort followed by take
> performs similarly to direct sort for small string lengths (<=16), but
> indirect sort starts to win afterwards.
>
> The version with quicksort and optimized sSWAP should be between 2x and
> 3x times faster than current mergesort implementation, so the advantage
> for direct sort could grow up to somewhere between 50 and 100.  A nice
> idea could be doing some more toughful experiments in order to find the
> point where an indirect sort followed by a take would be more efficient
> and automatically select this method beyond that point.
>
> However, this has the drawback that you have to use additional memory
> for keeping the indices in the indirect method.  Of course, when
> strings are large, those indices should take a rather negligible space
> compared with strings itself.  In any case, in some situations where
> space is critical, this can still be important.  I don't know, but my
> opinion is that we shouldn't take too agressive optimizations for that
> matter.  My vote is to document this possibility in the docstrings, so
> that the user wanting for extreme performance can use this approach if
> he wants to.  Still, for string sizes greater than, say, 1000, well, an
> automatic selection of the indirect method is very tempting indeed.
>

The strings with zeros problem runs deeper than it looked at first glance.
Normal sorts don't work either, which means the type has a bad comparison
function. And argsort still doesn't work even with the correct comparison
function. Python, however, works as it should sorting lists of strings with
zeros. So I'm going to have to track down and fix this oddity, but it is
going to delay putting in the type specific quicksort for strings.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080209/d03ed184/attachment.html>


More information about the NumPy-Discussion mailing list