On Wed, Jun 29, 2022 at 2:59 PM Miles Cranmer <miles.cranmer@gmail.com> wrote:
So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower.

In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing.

The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types).

Does this help explain things?

That is indeed helpful, thanks Miles!

Cheers,
Ralf