As always, when optimising you must profile. For A of size 3000000 and k of
size 30, this is what I get:


89% of the time is being spent on this line (and it gets worse as you
increase the size of A):

indicesClosest = final_indices[inv_idx_sort]
I don't know from the top of my head of a faster way of doing this, so, can
you somehow adapt your problem to use your sorted indexes of A instead?
This you can very easily rewrite unrolled in Cython, I think you can scrape
a bit of time there.

Here is a good tutorial for Numpy:
