Re: [Numpy-discussion] Sorting refactor

2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices. * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. [1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets

On 16.01.2015 12:33, Lars Buitinck wrote:
2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: <CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following:
* The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations.
* The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices.
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. But it also is not costly to do, e.g. the selection code is a introselect instead of a normal quickselect. I'd say not high priority, but if someone wants to do it I don't see why not.
* Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements.
* Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet.
I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. The issue is that its probably too much complicated code for only a very small gain.
* Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?).
you should also be able to do this with openmp tasks, though it will be a little less efficient as cilk+ has a better scheduler for this type of work. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though).
This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort.
[1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I don't know if there is a general consensus or guideline on these matters, but I am personally not entirely charmed by the use of behind-the-scenes parallelism, unless explicitly requested. Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble? On Fri, Jan 16, 2015 at 12:43 PM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 16.01.2015 12:33, Lars Buitinck wrote:
2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: < CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following:
* The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations.
* The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices.
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. But it also is not costly to do, e.g. the selection code is a introselect instead of a normal quickselect. I'd say not high priority, but if someone wants to do it I don't see why not.
* Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements.
* Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet.
I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. The issue is that its probably too much complicated code for only a very small gain.
* Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?).
you should also be able to do this with openmp tasks, though it will be a little less efficient as cilk+ has a better scheduler for this type of work. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though).
This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort.
[1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 16 January 2015 at 13:15, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com> wrote:
Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble?
As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores. /David.

On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck <larsmans@gmail.com> wrote:
2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: < CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following:
* The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations.
They are harder to read, but so cute to look at! C code just wouldn't feel the same without some magical pointer arithmetic thrown in here and there. ;-)
* The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices.
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
* Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements.
Java famously changed its library implementation of quicksort to a dual pivot one invented by Vladimir Yaroslavskiy[1], they claim that with substantial performance gains. I tried to implement that for numpy [2], but couldn't get it to work any faster than the current code. * Using optimal sorting networks instead of insertion sort as the base
case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet.
Even if we stick with selection sort, we should spin it off into an inline smallsort function within the npy_sort library, and have quicksort and mergesort call the same function, instead of each implementing their own. It would make optimizations like the sorting networks easier to implement for all sorts. We could even expose it outside npy_sort, as there are a few places around the code base that have ad-hoc implementations of sorting.
* Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?).
This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort.
Timsort came up in a discussion several months ago, where I proposed adding a mergesorted function (which I have mostly ready, by the way, [3]) to speed-up some operations in arraysetops. I have serious doubts that it will perform comparably to the other sorts unless comparisons are terribly expensive, which they typically aren't in numpy, but it has been an interesting learning exercise so far, and I don't mind taking it all the way. Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-) Jaime [1] http://iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf [2] https://github.com/jaimefrio/numpy/commit/a99dd77cda7c4c0f2df1fb17a59c20e199... [3] https://github.com/jaimefrio/numpy/commit/2f53c99e7ec6d14fd77a29f9d3c1712d5b...
[1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck <larsmans@gmail.com> wrote:
2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: < CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following:
* The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations.
They are harder to read, but so cute to look at! C code just wouldn't feel the same without some magical pointer arithmetic thrown in here and there. ;-)
Pointers were faster than indexing. That advantage can be hardware dependent, but for small numbers of pointers is typical.
* The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices
Fortran uses the same pointer trick for one based indexing, or at least the old DEC compilers did ;) There is no reason to avoid it.
.
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
* Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements.
Java famously changed its library implementation of quicksort to a dual pivot one invented by Vladimir Yaroslavskiy[1], they claim that with substantial performance gains. I tried to implement that for numpy [2], but couldn't get it to work any faster than the current code.
For sorting, simple often beats fancy.
* Using optimal sorting networks instead of insertion sort as the base
case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet.
I expect the gains here would be for small sorts, which tend to be dominated by call overhead.
Even if we stick with selection sort, we should spin it off into an inline smallsort function within the npy_sort library, and have quicksort and mergesort call the same function, instead of each implementing their own. It would make optimizations like the sorting networks easier to implement for all sorts. We could even expose it outside npy_sort, as there are a few places around the code base that have ad-hoc implementations of sorting.
Good idea, I've thought of doing it myself.
* Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?).
This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort.
Timsort came up in a discussion several months ago, where I proposed adding a mergesorted function (which I have mostly ready, by the way, [3]) to speed-up some operations in arraysetops. I have serious doubts that it will perform comparably to the other sorts unless comparisons are terribly expensive, which they typically aren't in numpy, but it has been an interesting learning exercise so far, and I don't mind taking it all the way.
Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-)
I have a generic quicksort with standard interface sitting around somewhere in an ancient branch. Sorting objects needs to be sensitive to comparison exceptions, which is something to keep in mind. I'd also like to push the GIL release back down into the interface functions where it used to be, but that isn't a priority. Another other possibility I've toyed with is adding a step for sorting non-contiguous arrays, but the sort functions being part of the dtype complicates that for compatibility reasons. I suppose that could be handled with interface functions. I think the prototypes should also be regularized. Cleaning up the sorting dispatch to use just one function and avoid the global would be good, the current code is excessively ugly. That cleanup, together with a generic quicksort, would be a good place to start. And remember, simpler is better. Usually. Chuck

On Fri, Jan 16, 2015 at 8:15 AM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck <larsmans@gmail.com> wrote:
2015-01-16 11:55 GMT+01:00 <numpy-discussion-request@scipy.org>:
Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o <jaime.frio@gmail.com> Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python <numpy-discussion@scipy.org> Message-ID: < CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own.
Yes, they do. I've been looking at the sorting functions as well and I've found the following:
* The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations.
They are harder to read, but so cute to look at! C code just wouldn't feel the same without some magical pointer arithmetic thrown in here and there. ;-)
Pointers were faster than indexing. That advantage can be hardware dependent, but for small numbers of pointers is typical.
* The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices
Fortran uses the same pointer trick for one based indexing, or at least the old DEC compilers did ;) There is no reason to avoid it.
.
* Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that.
* Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements.
Java famously changed its library implementation of quicksort to a dual pivot one invented by Vladimir Yaroslavskiy[1], they claim that with substantial performance gains. I tried to implement that for numpy [2], but couldn't get it to work any faster than the current code.
For sorting, simple often beats fancy.
* Using optimal sorting networks instead of insertion sort as the base
case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet.
I expect the gains here would be for small sorts, which tend to be dominated by call overhead.
Even if we stick with selection sort, we should spin it off into an inline smallsort function within the npy_sort library, and have quicksort and mergesort call the same function, instead of each implementing their own. It would make optimizations like the sorting networks easier to implement for all sorts. We could even expose it outside npy_sort, as there are a few places around the code base that have ad-hoc implementations of sorting.
Good idea, I've thought of doing it myself.
* Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?).
This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort.
Timsort came up in a discussion several months ago, where I proposed adding a mergesorted function (which I have mostly ready, by the way, [3]) to speed-up some operations in arraysetops. I have serious doubts that it will perform comparably to the other sorts unless comparisons are terribly expensive, which they typically aren't in numpy, but it has been an interesting learning exercise so far, and I don't mind taking it all the way.
Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-)
I have a generic quicksort with standard interface sitting around somewhere in an ancient branch. Sorting objects needs to be sensitive to comparison exceptions, which is something to keep in mind. I'd also like to push the GIL release back down into the interface functions where it used to be, but that isn't a priority. Another other possibility I've toyed with is adding a step for sorting non-contiguous arrays, but the sort functions being part of the dtype complicates that for compatibility reasons. I suppose that could be handled with interface functions. I think the prototypes should also be regularized.
Cleaning up the sorting dispatch to use just one function and avoid the global would be good, the current code is excessively ugly. That cleanup, together with a generic quicksort, would be a good place to start.
Let the fun begin then.. I have just sent PR #5458, in case anyone wants to take a look. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
participants (6)
-
Charles R Harris
-
Daπid
-
Eelco Hoogendoorn
-
Jaime Fernández del Río
-
Julian Taylor
-
Lars Buitinck