[Numpy-discussion] Implementing a "find first" style function

Tue Mar 5 09:15:07 EST 2013

The ticket https://github.com/numpy/numpy/issues/2269 discusses the
possibility of implementing a "find first" style function which can
optimise the process of finding the first value(s) which match a predicate
in a given 1D array. For example:

>>> a = np.sin(np.linspace(0, np.pi, 200))
>>> print find_first(a, lambda a: a > 0.9)
((71, ), 0.900479032457)

This has been discussed in several locations:

https://github.com/numpy/numpy/issues/2269
https://github.com/numpy/numpy/issues/2333
http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item

*Rationale*

For small arrays there is no real reason to avoid doing:

>>> a = np.sin(np.linspace(0, np.pi, 200))
>>> ind = (a > 0.9).nonzero()[0][0]
>>> print (ind, ), a[ind]
(71,) 0.900479032457

But for larger arrays, this can lead to massive amounts of work even if the
result is one of the first to be computed. Example:

>>> a = np.arange(1e8)
>>> print (a == 5).nonzero()[0][0]
5

So a function which terminates when the first matching value is found is
desirable.

As mentioned in #2269, it is possible to define a consistent ordering which
allows this functionality for >1D arrays, but IMHO it overcomplicates the
problem and was not a case that I personally needed, so I've limited the
scope to 1D arrays only.

*Implementation*

My initial assumption was that to get any kind of performance I would need
to write the *find* function in C, however after prototyping with some
array chunking it became apparent that a trivial python function would be
quick enough for my needs.

The approach I've implemented in the code found in #2269 simply breaks the
array into sub-arrays of maximum length *chunk_size* (2048 by default,
though there is no real science to this number), applies the given
predicating function, and yields the results from *nonzero()*. The given
function should be a python function which operates on the whole of the
sub-array element-wise (i.e. the function should be vectorized). Returning
a generator also has the benefit of allowing users to get the first
*n*matching values/indices.

*Results*

I timed the implementation of *find* found in my comment at
https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
obvious test:

In [1]: from np_utils import find

In [2]: import numpy as np

In [3]: import numpy.random

In [4]: np.random.seed(1)

In [5]: a = np.random.randn(1e8)

In [6]: a.min(), a.max()
Out[6]: (-6.1194900990552776, 5.9632246301166321)

In [7]: next(find(a, lambda a: np.abs(a) > 6))
Out[7]: ((33105441,), -6.1194900990552776)

In [8]: (np.abs(a) > 6).nonzero()
Out[8]: (array([33105441]),)

In [9]: %timeit (np.abs(a) > 6).nonzero()
1 loops, best of 3: 1.51 s per loop

In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
1 loops, best of 3: 912 ms per loop

In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000))
1 loops, best of 3: 470 ms per loop

In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=1000000))
1 loops, best of 3: 483 ms per loop

This shows that picking a sensible *chunk_size* can yield massive speed-ups
(nonzero is x3 slower in one case). A similar example with a much smaller
1D array shows similar promise:

In [41]: a = np.random.randn(1e4)

In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3))
10000 loops, best of 3: 35.8 us per loop

In [43]: %timeit (np.abs(a) > 3).nonzero()
10000 loops, best of 3: 148 us per loop

As I commented on the issue tracker, if you think this function is worth
taking forward, I'd be happy to open up a pull request.

Feedback greatfully received.

Cheers,

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130305/5fb42bc1/attachment.html>