[Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

Mon May 15 22:02:35 EDT 2017

> From: Stephan Hoyer <shoyer at gmail.com>
>
> I like the idea of a strategy keyword argument. strategy='auto' leaves the
> door open for future improvements, e.g., if we ever add hash tables to
> numpy.
>
> For the algorithm, I think we actually want to sort the needles array as
> well in most (all?) cases.
>
> If haystack is also sorted, advancing thorough both arrays at once brings
> down the cost of the actual search itself down to O(n+k). (Possibly this is
> worth exposing as np.searchbothsorted or something similar?)

I actually suggest reducing the scope of the problem to

>>> search_unique(haystack, needles)

which finds the index* in a haystack (of unique values) of each needle
and can make the assumption (e.g. as Stephen Hoyer points out) that if
len(needles)<<len(haystack), then you can just sort the needles and
iterate through the haystack in O(H log(N)) steps (otherwise you sort
the haystack and iterate through the needles instead).

Whilst you could make this function work in the more general
non-unique case (i.e. using stable sorts and finding the first
occurrence) I notice this often an accident and users would be better
served with an exception. I also prefer raising an exception if a
needle is unfound (perhaps with an unfound_index=None kwarg).

Finally I am a little wary of strategy=‘hashtable’ (though I love
hash-tables!) since choosing the one hash that is great for everyone’s
data does not seem an easy problem.

Peter

* ‘index_’ sounds more attractive than ‘search_’, but we’re stuck with
the python connotation it would be the first occurrence and the numpy
connotation of that you are giving rather than receiving indices.