[Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

Mon May 15 19:25:32 EDT 2017

I like the idea of a strategy keyword argument. strategy='auto' leaves the
door open for future improvements, e.g., if we ever add hash tables to
numpy.

For the algorithm, I think we actually want to sort the needles array as
well in most (all?) cases.

If haystack is also sorted, advancing thorough both arrays at once brings
down the cost of the actual search itself down to O(n+k). (Possibly this is
worth exposing as np.searchbothsorted or something similar?)

If we don't sort haystack, we can still use binary search on the needles to
bring the search cost down to O(n log k).
On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <njs at pobox.com> wrote:

> On May 9, 2017 9:47 AM, "Martin Spacek" <numpy at mspacek.mm.st> wrote:
>
> Hello,
>
> I've opened up a pull request to add a function called np.search(), or
> something like it, to complement np.searchsorted():
>
> https://github.com/numpy/numpy/pull/9055
>
> There's also this issue I opened before starting the PR:
>
> https://github.com/numpy/numpy/issues/9052
>
> Proposed API changes require discussion on the list, so here I am!
>
> This proposed function (and perhaps array method?) does the same as
> np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
> checks if all the values in `v` are a subset of those in `a`. If not, it
> currently raises an error, but that could be controlled via a kwarg.
>
> As I mentioned in the PR, I often find myself abusing np.searchsorted() by
> not explicitly checking these assumptions. The temptation to use it is
> great, because it's such a fast and convenient function, and most of the
> time that I use it, the assumptions are indeed valid. Explicitly checking
> those assumptions each and every time before I use np.searchsorted() is
> tedious, and easy to forget to do. I wouldn't be surprised if many others
> abuse np.searchsorted() in the same way.
>
>
> It's worth noting though that the "sorted" part is a critical part of what
> makes it fast. If we're looking for k needles in an n-item haystack, then:
>
> If the haystack is already sorted and we know it, using searchsorted does
> it in k*log2(n) comparisons. (Could be reduced to average case O(k log log
> n) for simple scalars by using interpolation search, but I don't think
> searchsorted is that clever atm.)
>
> If the haystack is not sorted, then sorting it and then using searchsorted
> requires a total of O(n log n) + k*log2(n) comparisons.
>
> And if the haystack is not sorted, then doing linear search to find the
> first item like list.index does requires on average 0.5*k*n comparisons.
>
> This analysis ignores memory effects, which are important -- linear memory
> access is faster than random access, and searchsorted is all about making
> memory access maximally unpredictable. But even so, I think
> sorting-then-searching will be reasonably competitive pretty much from the
> start, and for moderately large k and n values the difference between (n +
> k)*log(n) and n*k is huge.
>
> Another issue is that sorting requires an O(n)-sized temporary buffer
> (assuming you can't mutate the haystack in place). But if your haystack is
> a large enough fraction of memory that you can't afford is buffer, then
> it's likely large enough that you can't afford linear searching either...
>
>
> Looking at my own habits and uses, it seems to me that finding the indices
> of matching values of one array in another is a more common use case than
> finding insertion indices of one array into another sorted array. So, I
> propose that np.search(), or something like it, could be even more useful
> than np.searchsorted().
>
>
> My main concern here would be creating a trap for the unwary, where people
> use search() naively because it's so nice and convenient, and then
> eventually get surprised by a nasty quadratic slowdown. There's a whole
> blog about these traps :-) https://accidentallyquadratic.tumblr.com/
>
> Otoh there are also huge number of numpy use cases where it doesn't matter
> if some calculation is 1000x slower than it should be, as long as it works
> and is discoverable...
>
> So it sounds like one obvious thing would be to have a version of
> searchsorted that checks for matches (maybe side="exact"? Though that's not
> easy to find...). That's clearly useful, and orthogonal to the
> linear/binary search issue, so we shouldn't make it a reason people are
> tempted to choose the inferior algorithm.
>
> ...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with
> options "linear", "searchsorted", and "auto". Linear does the obvious
> thing, searchsorted generates a sorter array using argsort (unless the user
> provided one) and then calls searchsorted, and auto picks one of them
> depending on whether a sorter array was provided and how large the arrays
> are. The default is auto. In all cases it looks for exact matches.
>
> I guess by default "not found" should be signaled with an exception, and
> then there should be some option to have it return a sentinel value
> instead? The problem is that since we're returning integers then there's no
> sentinel value that's necessarily an invalid index (e.g. indexing will
> happily accept -1).
>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170515/4bbc73c5/attachment-0001.html>