Speeding up `unique` and adding "kind" parameter

Dear all, There is a PR that adds a lookup table approach to `unique`, shown below. You can get up to ~16x speedup for large integer arrays, at the cost of potentially greater memory usage. https://github.com/numpy/numpy/pull/21843 This is controlled by a new `kind` parameter, which is described below. The current approach uses "sort" while the proposed change implements the "table" method. The default option None will automatically select "table" when possible and memory usage is not too large. ``` kind : {None, 'sort', 'table'}, optional The algorithm to use. This will not affect the final result, but will affect the speed and memory use. The default, None, will select automatically based on memory considerations. * If 'sort', will use a mergesort-based approach. * If 'table', will use a lookup table approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar` plus the max-min value of `ar`. The options `return_index`, `return_inverse`, `axis`, and `equal_nan` are unavailable with this option. * If None, will automatically choose 'table' if possible, and the required memory allocation is less than or equal to 6 times the size of `ar`. Will otherwise will use 'sort'. This is done to not use a large amount of memory by default, even though 'table' may be faster in most cases. ``` The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this. I am eager to hear your comments on this new PR. Thanks! Miles

On Tue, Jun 28, 2022 at 6:33 PM Miles Cranmer <miles.cranmer@gmail.com> wrote:
Dear all,
There is a PR that adds a lookup table approach to `unique`, shown below. You can get up to ~16x speedup for large integer arrays, at the cost of potentially greater memory usage.
I've seen multiple requests for not sorting in `unique`, and the associated performance benefits can indeed be very large. So +1 in principle to add this option.
https://github.com/numpy/numpy/pull/21843
This is controlled by a new `kind` parameter, which is described below. The current approach uses "sort" while the proposed change implements the "table" method. The default option None will automatically select "table" when possible and memory usage is not too large.
You cannot switch the default behavior, that will break backwards compatibility.
``` kind : {None, 'sort', 'table'}, optional
Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits. Cheers, Ralf The algorithm to use. This will not affect the final result,
but will affect the speed and memory use. The default, None, will select automatically based on memory considerations.
* If 'sort', will use a mergesort-based approach. * If 'table', will use a lookup table approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar` plus the max-min value of `ar`. The options `return_index`, `return_inverse`, `axis`, and `equal_nan` are unavailable with this option. * If None, will automatically choose 'table' if possible, and the required memory allocation is less than or equal to 6 times the size of `ar`. Will otherwise will use 'sort'. This is done to not use a large amount of memory by default, even though 'table' may be faster in most cases. ``` The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this.
I am eager to hear your comments on this new PR.
Thanks! Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com

On Tue, 28 Jun 2022, 6:50 pm Ralf Gommers, <ralf.gommers@gmail.com> wrote:
``` kind : {None, 'sort', 'table'}, optional
Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits.
But that suggests that, if I wanted them sorted, I should use the "sorted" kind, but it is probably faster to do a table unique and sort the results. There are two concerns from the point of view of the user. One is the sorting of the results, and the other is memory usage. I suggest adding two boolean flags, "low_memory" (following sklearn, and I think, scipy too), and "sorted". Depending on the algorithm, "sorted=True" will perform a sort, or do nothing. /David
Cheers, Ralf
The algorithm to use. This will not affect the final result,
but will affect the speed and memory use. The default, None, will select automatically based on memory considerations.
* If 'sort', will use a mergesort-based approach. * If 'table', will use a lookup table approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar` plus the max-min value of `ar`. The options `return_index`, `return_inverse`, `axis`, and `equal_nan` are unavailable with this option. * If None, will automatically choose 'table' if possible, and the required memory allocation is less than or equal to 6 times the size of `ar`. Will otherwise will use 'sort'. This is done to not use a large amount of memory by default, even though 'table' may be faster in most cases. ``` The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this.
I am eager to hear your comments on this new PR.
Thanks! Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: davidmenhur@gmail.com

Thanks for the comments Ralf!
You cannot switch the default behavior, that will break backwards compatibility.
The default `kind=None` have no effect on input/output behavior of the function. The only changes a user will see are in terms of speed and memory usage. `unique` will select this new algorithm `"table"` only if it is available (integral array, no axis specified, return_index and return_inverse set to False) and the required memory allocation is not too big (which is arbitrarily defined as six times the allocation of the input array - similar to what the sorting method use). Using `kind="table"` won't affect the input/output either, but it is only available for certain arrays (somewhat similar to the usage of `assume_unique` for `np.isin`). Does this sound fine to you?
Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits.
The names `'table'` and `'sort'` were selected for consistency as these names were recently put in place for `np.isin` and `np.in1d`, and the analogous methods used for `unique` are conceptually similar. I have no particular attachment to either name though. Thanks again! Miles

Ah, I did not clarify this: `kind="table"` will *also* return a sorted array. It simply does not use a sorting algorithm to get to it. This is because the table is generated using `np.arange` (i.e., already sorted) which is then masked.

On Tue, Jun 28, 2022 at 7:21 PM Miles Cranmer <miles.cranmer@gmail.com> wrote:
Thanks for the comments Ralf!
You cannot switch the default behavior, that will break backwards compatibility.
The default `kind=None` have no effect on input/output behavior of the function. The only changes a user will see are in terms of speed and memory usage. `unique` will select this new algorithm `"table"` only if it is available (integral array, no axis specified, return_index and return_inverse set to False) and the required memory allocation is not too big (which is arbitrarily defined as six times the allocation of the input array - similar to what the sorting method use). Using `kind="table"` won't affect the input/output either, but it is only available for certain arrays (somewhat similar to the usage of `assume_unique` for `np.isin`). Does this sound fine to you?
Ah, I didn't get from the initial description that the results would still be sorted. Then I'd say that I'm not sure that: 1. I'm not sure that the performance vs. complexity trade-off is worth it. But I'll leave that to others; Sebastian seemed happy to merge the similar change in `in1d` 2. If you're interested in working more on this, implementing an unsorted option would be more interesting imho - significantly more performance benefits, and not just for integers or at the cost of more memory use. Cheers, Ralf
Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits.
The names `'table'` and `'sort'` were selected for consistency as these names were recently put in place for `np.isin` and `np.in1d`, and the analogous methods used for `unique` are conceptually similar. I have no particular attachment to either name though.
Thanks again! Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com

Regarding 2., did you have a particular approach in mind? This new lookup table method is already O(n) scaling (similar to a counting sort), so I cannot fathom a method that, as you suggest, would get significantly better performance for integer arrays. The sorting here is "free" in some sense since you aren't spending additional cycles, the table is initialized pre-sorted. I could see in the future that one could create a similar approach for arbitrary datatypes, and you would use a hash table rather than a fixed-size array. In this case you would similarly get O(n) performance, and would produce an **unsorted** output. But in any case a hash table would probably be much less efficient for integers than an array as implemented here - the latter which also gives you a sorted output as a side effect. Personally I have used np.unique for integer data far more than any other type, and I would _guess_ it is similar in the broader community (though I have no data or insight to support that)–so I think having a big speedup like this could be quite nice for users. Thanks, Miles

On Wed, Jun 29, 2022 at 7:10 AM Miles Cranmer <miles.cranmer@gmail.com> wrote:
Regarding 2., did you have a particular approach in mind? This new lookup table method is already O(n) scaling (similar to a counting sort), so I cannot fathom a method that, as you suggest, would get significantly better performance for integer arrays. The sorting here is "free" in some sense since you aren't spending additional cycles, the table is initialized pre-sorted.
I'm not an expert, just remember having seen talks and implementations using hashing that claim to be a lot faster. https://douglasorr.github.io/2019-09-hash-vs-sort/article.html is quite interesting, and shows it's a complex topic. Did you look for literature on this? I'd imagine that that exists. Your PRs don't have a References section in the docstring; if this is a published method then that would be great to add. Cheers, Ralf
I could see in the future that one could create a similar approach for arbitrary datatypes, and you would use a hash table rather than a fixed-size array. In this case you would similarly get O(n) performance, and would produce an **unsorted** output. But in any case a hash table would probably be much less efficient for integers than an array as implemented here - the latter which also gives you a sorted output as a side effect. Personally I have used np.unique for integer data far more than any other type, and I would _guess_ it is similar in the broader community (though I have no data or insight to support that)–so I think having a big speedup like this could be quite nice for users.
Thanks, Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com

So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower. In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing. The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types). Does this help explain things? Cheers, Miles

On Wed, Jun 29, 2022 at 2:59 PM Miles Cranmer <miles.cranmer@gmail.com> wrote:
So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower.
In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing.
The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort ( https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types).
Does this help explain things?
That is indeed helpful, thanks Miles! Cheers, Ralf

On Wed, 2022-06-29 at 12:56 +0000, Miles Cranmer wrote:
So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower.
Yeah, which is why I was happy with adding `kind=`. There was some support (and no-one against it) and it seemed unlikely that there is a "generic" solution that is can just be a great default in all cases. (If there is, deprecating `kind` is probably not terrible either.) Whether the `kind=` should cover methods that do not sort the output here, I am not sure. Maybe that should be its own kwarg. (Kinds that do not return a sorted result can still sort the result and hope it is much smaller). In general, I am happy with the addition though. The only larger argument against it that I can think of, would be if we promoted another library that does the job much better. Cheers, Sebastian
In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing.
The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types).
Does this help explain things? Cheers, Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
participants (4)
-
David Menéndez Hurtado
-
Miles Cranmer
-
Ralf Gommers
-
Sebastian Berg