On Feb 9, 2008 3:19 PM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
I'm also thinking that at some point it becomes more efficient to do a indirect sort followed by take than to move all those big strings around. But I guess we won't know where that point is until we have both versions available.
I've done some experiments in that matter too. They are saying that, with the current mergesort in NumPy, an indirect sort followed by take performs similarly to direct sort for small string lengths (<=16), but indirect sort starts to win afterwards.
The version with quicksort and optimized sSWAP should be between 2x and 3x times faster than current mergesort implementation, so the advantage for direct sort could grow up to somewhere between 50 and 100. A nice idea could be doing some more toughful experiments in order to find the point where an indirect sort followed by a take would be more efficient and automatically select this method beyond that point.
However, this has the drawback that you have to use additional memory for keeping the indices in the indirect method. Of course, when strings are large, those indices should take a rather negligible space compared with strings itself. In any case, in some situations where space is critical, this can still be important. I don't know, but my opinion is that we shouldn't take too agressive optimizations for that matter. My vote is to document this possibility in the docstrings, so that the user wanting for extreme performance can use this approach if he wants to. Still, for string sizes greater than, say, 1000, well, an automatic selection of the indirect method is very tempting indeed.
The strings with zeros problem runs deeper than it looked at first glance. Normal sorts don't work either, which means the type has a bad comparison function. And argsort still doesn't work even with the correct comparison function. Python, however, works as it should sorting lists of strings with zeros. So I'm going to have to track down and fix this oddity, but it is going to delay putting in the type specific quicksort for strings. Chuck