[Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

Tue May 16 05:31:52 EDT 2017

On Tue, May 16, 2017 at 1:25 AM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I like the idea of a strategy keyword argument. strategy='auto' leaves the
> door open for future improvements, e.g., if we ever add hash tables to
> numpy.
>
> For the algorithm, I think we actually want to sort the needles array as
> well in most (all?) cases.
>
> If haystack is also sorted, advancing thorough both arrays at once brings
> down the cost of the actual search itself down to O(n+k). (Possibly this is
> worth exposing as np.searchbothsorted or something similar?)
>
> If we don't sort haystack, we can still use binary search on the needles
> to bring the search cost down to O(n log k).
>

I have several old branches implementing most of this:

   - https://github.com/jaimefrio/numpy/tree/sorted-searchsorted This one
   adds a sorted_keys= kwarg to searchsorted that assumes the keys are also
   sorted and does the linear scan of both.
   - https://github.com/jaimefrio/numpy/tree/mergesorted Adds an internal
   mergesorted function that merges two sorted arrays into a new one, with
   several handlings of repeated items, originally intended to speed-up
   intersect1d and friends in arraysetops.
   - https://github.com/jaimefrio/numpy/tree/hash-unique Implements a hash
   table for numpy arrays, largely based on the implementation of Python
   dicts, and uses it for unique.

Some of the speed-ups are considerable, but I guess I found the extra
complexity didn't justify sending any of those as a PR.

But if we really want to make that search thing happen, then maybe it's
time to bring them back from the dead.

Jaime

> On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <njs at pobox.com> wrote:
>
>> On May 9, 2017 9:47 AM, "Martin Spacek" <numpy at mspacek.mm.st> wrote:
>>
>> Hello,
>>
>> I've opened up a pull request to add a function called np.search(), or
>> something like it, to complement np.searchsorted():
>>
>> https://github.com/numpy/numpy/pull/9055
>>
>> There's also this issue I opened before starting the PR:
>>
>> https://github.com/numpy/numpy/issues/9052
>>
>> Proposed API changes require discussion on the list, so here I am!
>>
>> This proposed function (and perhaps array method?) does the same as
>> np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
>> checks if all the values in `v` are a subset of those in `a`. If not, it
>> currently raises an error, but that could be controlled via a kwarg.
>>
>> As I mentioned in the PR, I often find myself abusing np.searchsorted()
>> by not explicitly checking these assumptions. The temptation to use it is
>> great, because it's such a fast and convenient function, and most of the
>> time that I use it, the assumptions are indeed valid. Explicitly checking
>> those assumptions each and every time before I use np.searchsorted() is
>> tedious, and easy to forget to do. I wouldn't be surprised if many others
>> abuse np.searchsorted() in the same way.
>>
>>
>> It's worth noting though that the "sorted" part is a critical part of
>> what makes it fast. If we're looking for k needles in an n-item haystack,
>> then:
>>
>> If the haystack is already sorted and we know it, using searchsorted does
>> it in k*log2(n) comparisons. (Could be reduced to average case O(k log log
>> n) for simple scalars by using interpolation search, but I don't think
>> searchsorted is that clever atm.)
>>
>> If the haystack is not sorted, then sorting it and then using
>> searchsorted requires a total of O(n log n) + k*log2(n) comparisons.
>>
>> And if the haystack is not sorted, then doing linear search to find the
>> first item like list.index does requires on average 0.5*k*n comparisons.
>>
>> This analysis ignores memory effects, which are important -- linear
>> memory access is faster than random access, and searchsorted is all about
>> making memory access maximally unpredictable. But even so, I think
>> sorting-then-searching will be reasonably competitive pretty much from the
>> start, and for moderately large k and n values the difference between (n +
>> k)*log(n) and n*k is huge.
>>
>> Another issue is that sorting requires an O(n)-sized temporary buffer
>> (assuming you can't mutate the haystack in place). But if your haystack is
>> a large enough fraction of memory that you can't afford is buffer, then
>> it's likely large enough that you can't afford linear searching either...
>>
>>
>> Looking at my own habits and uses, it seems to me that finding the
>> indices of matching values of one array in another is a more common use
>> case than finding insertion indices of one array into another sorted array.
>> So, I propose that np.search(), or something like it, could be even more
>> useful than np.searchsorted().
>>
>>
>> My main concern here would be creating a trap for the unwary, where
>> people use search() naively because it's so nice and convenient, and then
>> eventually get surprised by a nasty quadratic slowdown. There's a whole
>> blog about these traps :-) https://accidentallyquadratic.tumblr.com/
>>
>> Otoh there are also huge number of numpy use cases where it doesn't
>> matter if some calculation is 1000x slower than it should be, as long as it
>> works and is discoverable...
>>
>> So it sounds like one obvious thing would be to have a version of
>> searchsorted that checks for matches (maybe side="exact"? Though that's not
>> easy to find...). That's clearly useful, and orthogonal to the
>> linear/binary search issue, so we shouldn't make it a reason people are
>> tempted to choose the inferior algorithm.
>>
>> ...ok, how's this for a suggestion. Give np.search a strategy= kwarg,
>> with options "linear", "searchsorted", and "auto". Linear does the obvious
>> thing, searchsorted generates a sorter array using argsort (unless the user
>> provided one) and then calls searchsorted, and auto picks one of them
>> depending on whether a sorter array was provided and how large the arrays
>> are. The default is auto. In all cases it looks for exact matches.
>>
>> I guess by default "not found" should be signaled with an exception, and
>> then there should be some option to have it return a sentinel value
>> instead? The problem is that since we're returning integers then there's no
>> sentinel value that's necessarily an invalid index (e.g. indexing will
>> happily accept -1).
>>
>> -n
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170516/b8b7d154/attachment-0001.html>