On Tue, 28 Jun 2022, 6:50 pm Ralf Gommers, <ralf.gommers@gmail.com> wrote:
 
```
    kind : {None, 'sort', 'table'}, optional

Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits.

But that suggests that, if I wanted them  sorted, I should use the "sorted" kind, but it is probably faster to do a table unique and sort the results. 

There are two concerns from the point of view of the user. One is the sorting of the results, and the other is memory usage. I suggest adding two boolean flags, "low_memory" (following sklearn, and I think, scipy too), and "sorted". Depending on the algorithm, "sorted=True" will perform a sort, or do nothing.

/David 


Cheers,
Ralf

        The algorithm to use. This will not affect the final result,
        but will affect the speed and memory use. The default, None,
        will select automatically based on memory considerations.

        * If 'sort', will use a mergesort-based approach.
        * If 'table', will use a lookup table approach similar
          to a counting sort. This is only available for boolean and
          integer arrays. This will have a memory usage of the
          size of `ar` plus the max-min value of `ar`. The options
          `return_index`, `return_inverse`, `axis`, and `equal_nan`
          are unavailable with this option.
        * If None, will automatically choose 'table' if possible,
          and the required memory allocation is less than or equal to
          6 times the size of `ar`. Will otherwise will use 'sort'.
          This is done to not use a large amount of memory by default,
          even though 'table' may be faster in most cases.
```
The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this.

I am eager to hear your comments on this new PR.

Thanks!
Miles
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ralf.gommers@googlemail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: davidmenhur@gmail.com