
On Wed, Jun 29, 2022 at 7:10 AM Miles Cranmer <miles.cranmer@gmail.com> wrote:
Regarding 2., did you have a particular approach in mind? This new lookup table method is already O(n) scaling (similar to a counting sort), so I cannot fathom a method that, as you suggest, would get significantly better performance for integer arrays. The sorting here is "free" in some sense since you aren't spending additional cycles, the table is initialized pre-sorted.
I'm not an expert, just remember having seen talks and implementations using hashing that claim to be a lot faster. https://douglasorr.github.io/2019-09-hash-vs-sort/article.html is quite interesting, and shows it's a complex topic. Did you look for literature on this? I'd imagine that that exists. Your PRs don't have a References section in the docstring; if this is a published method then that would be great to add. Cheers, Ralf
I could see in the future that one could create a similar approach for arbitrary datatypes, and you would use a hash table rather than a fixed-size array. In this case you would similarly get O(n) performance, and would produce an **unsorted** output. But in any case a hash table would probably be much less efficient for integers than an array as implemented here - the latter which also gives you a sorted output as a side effect. Personally I have used np.unique for integer data far more than any other type, and I would _guess_ it is similar in the broader community (though I have no data or insight to support that)–so I think having a big speedup like this could be quite nice for users.
Thanks, Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com