On Tue, Jun 28, 2022 at 6:33 PM Miles Cranmer <miles.cranmer@gmail.com> wrote:
Dear all,

There is a PR that adds a lookup table approach to `unique`, shown below. You can get up to ~16x speedup for large integer arrays, at the cost of potentially greater memory usage.

I've seen multiple requests for not sorting in `unique`, and the associated performance benefits can indeed be very large. So +1 in principle to add this option.


https://github.com/numpy/numpy/pull/21843

This is controlled by a new `kind` parameter, which is described below. The current approach uses "sort" while the proposed change implements the "table" method. The default option None will automatically select "table" when possible and memory usage is not too large.

You cannot switch the default behavior, that will break backwards compatibility.
 
```
    kind : {None, 'sort', 'table'}, optional

Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits.

Cheers,
Ralf

        The algorithm to use. This will not affect the final result,
        but will affect the speed and memory use. The default, None,
        will select automatically based on memory considerations.

        * If 'sort', will use a mergesort-based approach.
        * If 'table', will use a lookup table approach similar
          to a counting sort. This is only available for boolean and
          integer arrays. This will have a memory usage of the
          size of `ar` plus the max-min value of `ar`. The options
          `return_index`, `return_inverse`, `axis`, and `equal_nan`
          are unavailable with this option.
        * If None, will automatically choose 'table' if possible,
          and the required memory allocation is less than or equal to
          6 times the size of `ar`. Will otherwise will use 'sort'.
          This is done to not use a large amount of memory by default,
          even though 'table' may be faster in most cases.
```
The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this.

I am eager to hear your comments on this new PR.

Thanks!
Miles
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ralf.gommers@googlemail.com