On Wed, 2022-06-29 at 12:56 +0000, Miles Cranmer wrote:
So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower.
Yeah, which is why I was happy with adding `kind=`. There was some support (and no-one against it) and it seemed unlikely that there is a "generic" solution that is can just be a great default in all cases. (If there is, deprecating `kind` is probably not terrible either.) Whether the `kind=` should cover methods that do not sort the output here, I am not sure. Maybe that should be its own kwarg. (Kinds that do not return a sorted result can still sort the result and hope it is much smaller). In general, I am happy with the addition though. The only larger argument against it that I can think of, would be if we promoted another library that does the job much better. Cheers, Sebastian
In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing.
The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types).
Does this help explain things? Cheers, Miles _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net