[Numpy-discussion] Implementing a "find first" style function

Fri Mar 8 12:38:23 EST 2013

Interesting. I hadn't thought of those. I've implemented (very roughly
without a sound logic check) and benchmarked:

def my_any(a, predicate, chunk_size=2048):
    try:
        next(find(a, predicate, chunk_size))
        return True
    except StopIteration:
        return False

def my_all(a, predicate, chunk_size=2048):
    return not my_any(a, lambda a: ~predicate(a), chunk_size)

With the following setup:

import numpy as np
import numpy.random

np.random.seed(1)
a = np.random.randn(1e8)

For a low frequency *any*:

In [12]: %timeit (np.abs(a) > 6).any()
1 loops, best of 3: 1.29 s per loop

In [13]: %timeit my_any(a, lambda a: np.abs(a) > 6)
1 loops, best of 3: 792 ms per loop

In [14]: %timeit my_any(a, lambda a: np.abs(a) > 6, chunk_size=10000)
1 loops, best of 3: 654 ms per loop

For a False *any*:

In [16]: %timeit (np.abs(a) > 7).any()
1 loops, best of 3: 1.22 s per loop

In [17]: %timeit my_any(a, lambda a: np.abs(a) > 7)
1 loops, best of 3: 2.4 s per loop

For a high probability *any*:

In [28]: %timeit (np.abs(a) > 1).any()
1 loops, best of 3: 972 ms per loop

In [27]: %timeit my_any(a, lambda a: np.abs(a) > 1)
10000 loops, best of 3: 67 us per loop

---------------

For a low probability *all*:

In [18]: %timeit (np.abs(a) < 6).all()
1 loops, best of 3: 1.16 s per loop

In [19]: %timeit my_all(a, lambda a: np.abs(a) < 6)
1 loops, best of 3: 880 ms per loop

In [20]: %timeit my_all(a, lambda a: np.abs(a) < 6, chunk_size=10000)
1 loops, best of 3: 706 ms per loop

For a True *all*:

In [22]: %timeit (np.abs(a) < 7).all()
1 loops, best of 3: 1.47 s per loop

In [23]: %timeit my_all(a, lambda a: np.abs(a) < 7)
1 loops, best of 3: 2.65 s per loop

For a high probability *all*:

In [25]: %timeit (np.abs(a) < 1).all()
1 loops, best of 3: 978 ms per loop

In [26]: %timeit my_all(a, lambda a: np.abs(a) < 1)
10000 loops, best of 3: 73.6 us per loop

On 6 March 2013 21:16, Benjamin Root <ben.root at ou.edu> wrote:

>
>
> On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson <pelson.pub at gmail.com> wrote:
>
>> The ticket https://github.com/numpy/numpy/issues/2269 discusses the
>> possibility of implementing a "find first" style function which can
>> optimise the process of finding the first value(s) which match a predicate
>> in a given 1D array. For example:
>>
>>
>> >>> a = np.sin(np.linspace(0, np.pi, 200))
>> >>> print find_first(a, lambda a: a > 0.9)
>> ((71, ), 0.900479032457)
>>
>>
>> This has been discussed in several locations:
>>
>> https://github.com/numpy/numpy/issues/2269
>> https://github.com/numpy/numpy/issues/2333
>>
>> http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item
>>
>>
>> *Rationale*
>>
>> For small arrays there is no real reason to avoid doing:
>>
>> >>> a = np.sin(np.linspace(0, np.pi, 200))
>> >>> ind = (a > 0.9).nonzero()[0][0]
>> >>> print (ind, ), a[ind]
>> (71,) 0.900479032457
>>
>>
>> But for larger arrays, this can lead to massive amounts of work even if
>> the result is one of the first to be computed. Example:
>>
>> >>> a = np.arange(1e8)
>> >>> print (a == 5).nonzero()[0][0]
>> 5
>>
>>
>> So a function which terminates when the first matching value is found is
>> desirable.
>>
>> As mentioned in #2269, it is possible to define a consistent ordering
>> which allows this functionality for >1D arrays, but IMHO it overcomplicates
>> the problem and was not a case that I personally needed, so I've limited
>> the scope to 1D arrays only.
>>
>>
>> *Implementation*
>>
>> My initial assumption was that to get any kind of performance I would
>> need to write the *find* function in C, however after prototyping with
>> some array chunking it became apparent that a trivial python function would
>> be quick enough for my needs.
>>
>> The approach I've implemented in the code found in #2269 simply breaks
>> the array into sub-arrays of maximum length *chunk_size* (2048 by
>> default, though there is no real science to this number), applies the given
>> predicating function, and yields the results from *nonzero()*. The given
>> function should be a python function which operates on the whole of the
>> sub-array element-wise (i.e. the function should be vectorized). Returning
>> a generator also has the benefit of allowing users to get the first *n*matching values/indices.
>>
>>
>> *Results*
>>
>>
>> I timed the implementation of *find* found in my comment at
>> https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with an
>> obvious test:
>>
>>
>> In [1]: from np_utils import find
>>
>> In [2]: import numpy as np
>>
>> In [3]: import numpy.random
>>
>> In [4]: np.random.seed(1)
>>
>> In [5]: a = np.random.randn(1e8)
>>
>> In [6]: a.min(), a.max()
>> Out[6]: (-6.1194900990552776, 5.9632246301166321)
>>
>> In [7]: next(find(a, lambda a: np.abs(a) > 6))
>> Out[7]: ((33105441,), -6.1194900990552776)
>>
>> In [8]: (np.abs(a) > 6).nonzero()
>> Out[8]: (array([33105441]),)
>>
>> In [9]: %timeit (np.abs(a) > 6).nonzero()
>> 1 loops, best of 3: 1.51 s per loop
>>
>> In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
>> 1 loops, best of 3: 912 ms per loop
>>
>> In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000))
>> 1 loops, best of 3: 470 ms per loop
>>
>> In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6,
>> chunk_size=1000000))
>> 1 loops, best of 3: 483 ms per loop
>>
>>
>> This shows that picking a sensible *chunk_size* can yield massive
>> speed-ups (nonzero is x3 slower in one case). A similar example with a much
>> smaller 1D array shows similar promise:
>>
>> In [41]: a = np.random.randn(1e4)
>>
>> In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3))
>> 10000 loops, best of 3: 35.8 us per loop
>>
>> In [43]: %timeit (np.abs(a) > 3).nonzero()
>> 10000 loops, best of 3: 148 us per loop
>>
>>
>> As I commented on the issue tracker, if you think this function is worth
>> taking forward, I'd be happy to open up a pull request.
>>
>> Feedback greatfully received.
>>
>> Cheers,
>>
>> Phil
>>
>>
>>
> In the interest of generalizing code and such, could such approaches be
> used for functions like np.any() and np.all() for short-circuiting if True
> or False (respectively) are found?  I wonder what other sort of functions
> in NumPy might benefit from this?
>
> Ben Root
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130308/73b041e0/attachment.html>