Hi all, I have been taking a deep look at the sorting functionality in numpy, and I think it could use a face lift in the form of a big code refactor, to get rid of some of the ugliness in the code and make it easier to maintain. What I have in mind basically amounts to: 1. Refactor _new_argsortlike to get rid of code duplication (there are two branches, one with buffering, one without, virtually identical, that could be merged into a single one). 2. Modify _new_argsortlike so that it can properly handle byte-swapped inputs of any dtype, see gh-5441. Add proper handling of types with references, in preparation for the rest of changes. 3. Add three functions to the npy_sort library: npy_aquicksort, npy_aheapsort, npy_amergesort, with a signature compatible with PyArray_ArgSortFunc , i.e. (char* start, npy_intp* result, npy_intp length, PyArrayObject *arr). These turn out to be almost identical to the string and unicode sort functions, but using the dtype's compare function to handle comparisons. 4. Modify PyArray_ArgSort (and PyArray_ArgPartition) to always call _new_argsortlike, even when there is no type specific argsort function, by using the newly added npy_axxx functions. This simplifies PyArray_ArgSort a lot, and gets rid of some of the global variable ugliness in the current code. And makes argsorting over non-contiguous axis more memory efficient. 5. Refactor _new_sortlike similarly to _new_argsortlike 6. Modify the npy_quicksort, npy_mergesort and npy_heapsort functions in npy_sort to have a signature compatible with PyArray_SortFunc, i.e. (char* start, npy_intp length, PyArrayObject *arr). npy_quicksort will no longer rely on libc's qsort, but be very similar to the string or unicode quicksort functions 7. Modify PyArray_Sort (and PyArray_Partition) to always call _new_sortlike, similarly to what was done with PyArray_ArgSort. This allows completing the removal of the remaining global variable ugliness, as well as similar benefits as for argsort before. This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. So my questions, mostly to the poor souls that will have to code review changes to several hundred lines of code: 1. Does this make sense, or is it better left alone? A subset of 1, 2 and 5 are a must to address the issues in gh-5441, the rest could arguably be left as is. 2. Would you rather see it submitted as one ginormous PR? Or split into 4 or 5 incremental ones? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
Hi, On Fri, Jan 16, 2015 at 5:24 AM, Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Hi all,
I have been taking a deep look at the sorting functionality in numpy, and I think it could use a face lift in the form of a big code refactor, to get rid of some of the ugliness in the code and make it easier to maintain. What I have in mind basically amounts to:
Refactor _new_argsortlike to get rid of code duplication (there are two branches, one with buffering, one without, virtually identical, that could be merged into a single one). Modify _new_argsortlike so that it can properly handle byte-swapped inputs of any dtype, see gh-5441. Add proper handling of types with references, in preparation for the rest of changes. Add three functions to the npy_sort library: npy_aquicksort, npy_aheapsort, npy_amergesort, with a signature compatible with PyArray_ArgSortFunc , i.e. (char* start, npy_intp* result, npy_intp length, PyArrayObject *arr). These turn out to be almost identical to the string and unicode sort functions, but using the dtype's compare function to handle comparisons. Modify PyArray_ArgSort (and PyArray_ArgPartition) to always call _new_argsortlike, even when there is no type specific argsort function, by using the newly added npy_axxx functions. This simplifies PyArray_ArgSort a lot, and gets rid of some of the global variable ugliness in the current code. And makes argsorting over non-contiguous axis more memory efficient. Refactor _new_sortlike similarly to _new_argsortlike Modify the npy_quicksort, npy_mergesort and npy_heapsort functions in npy_sort to have a signature compatible with PyArray_SortFunc, i.e. (char* start, npy_intp length, PyArrayObject *arr). npy_quicksort will no longer rely on libc's qsort, but be very similar to the string or unicode quicksort functions Modify PyArray_Sort (and PyArray_Partition) to always call _new_sortlike, similarly to what was done with PyArray_ArgSort. This allows completing the removal of the remaining global variable ugliness, as well as similar benefits as for argsort before.
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. So my questions, mostly to the poor souls that will have to code review changes to several hundred lines of code:
Does this make sense, or is it better left alone? A subset of 1, 2 and 5 are a must to address the issues in gh-5441, the rest could arguably be left as is. Would you rather see it submitted as one ginormous PR? Or split into 4 or 5 incremental ones?
Do you think it would be possible to split this into several PRs, with the initial one being the refactoring, and the subsequent ones being additions to sorting functionality? I'm guessing that the refactoring is something everyone wants (sounds great to me), whereas changes to the sorting needs more specific discussion. Cheers, Matthew
On Fri, Jan 16, 2015 at 4:19 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
Hi all,
I have been taking a deep look at the sorting functionality in numpy, and I think it could use a face lift in the form of a big code refactor, to get rid of some of the ugliness in the code and make it easier to maintain. What I have in mind basically amounts to:
Refactor _new_argsortlike to get rid of code duplication (there are two branches, one with buffering, one without, virtually identical, that could be merged into a single one). Modify _new_argsortlike so that it can properly handle byte-swapped inputs of any dtype, see gh-5441. Add proper handling of types with references, in preparation for the rest of changes. Add three functions to the npy_sort library: npy_aquicksort, npy_aheapsort, npy_amergesort, with a signature compatible with PyArray_ArgSortFunc , i.e. (char* start, npy_intp* result, npy_intp length, PyArrayObject *arr). These turn out to be almost identical to the string and unicode sort functions, but using the dtype's compare function to handle comparisons. Modify PyArray_ArgSort (and PyArray_ArgPartition) to always call _new_argsortlike, even when there is no type specific argsort function, by using the newly added npy_axxx functions. This simplifies PyArray_ArgSort a lot, and gets rid of some of the global variable ugliness in the current code. And makes argsorting over non-contiguous axis more memory efficient. Refactor _new_sortlike similarly to _new_argsortlike Modify the npy_quicksort, npy_mergesort and npy_heapsort functions in npy_sort to have a signature compatible with PyArray_SortFunc, i.e. (char* start, npy_intp length, PyArrayObject *arr). npy_quicksort will no longer rely on libc's qsort, but be very similar to the string or unicode quicksort functions Modify PyArray_Sort (and PyArray_Partition) to always call _new_sortlike, similarly to what was done with PyArray_ArgSort. This allows completing
removal of the remaining global variable ugliness, as well as similar benefits as for argsort before.
This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. So my questions, mostly to the poor souls that will have to code review changes to several hundred
On Fri, Jan 16, 2015 at 5:24 AM, Jaime Fernández del Río <jaime.frio@gmail.com> wrote: the lines
of code:
Does this make sense, or is it better left alone? A subset of 1, 2 and 5 are a must to address the issues in gh-5441, the rest could arguably be left as is. Would you rather see it submitted as one ginormous PR? Or split into 4 or 5 incremental ones?
Do you think it would be possible to split this into several PRs, with the initial one being the refactoring, and the subsequent ones being additions to sorting functionality?
Just to be clear, nothing in the long list of changes I posted earlier is truly a change in the sorting functionality, except in the case of quicksort (and argquicksort) for generic types, which would no longer rely on qsort, but on our own implementation of it, just like every other sort in numpy. So yes, the refactor PR can precede new functionality. And it can even be split into 3 or 4 incremental PRs, to make the reviewers life easier. Since it is most likely Charles and/or Julian that are going to have to swallow the review pill, I'd like to hear from them how would they like it better. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
participants (2)
-
Jaime Fernández del Río
-
Matthew Brett