[Numpy-discussion] Boolean arrays

Fri Aug 27 16:35:07 EDT 2010

On Fri, Aug 27, 2010 at 15:21, Nathaniel Smith <njs at pobox.com> wrote:
> On Fri, Aug 27, 2010 at 1:17 PM, Robert Kern <robert.kern at gmail.com> wrote:
>> But in any case, that would be very slow for large arrays since it
>> would invoke a Python function call for every value in ar. Instead,
>> iterate over the valid array, which is much shorter:
>>
>> mask = np.zeros(ar.shape, dtype=bool)
>> for good in valid:
>>    mask |= (ar == good)
>>
>> Wrap that up into a function and you're good to go. That's about as
>> efficient as it gets unless if the valid array gets large.
>
> Probably even more efficient if 'ar' is large and 'valid' is small,
> and shorter to boot:
>
> np.in1d(ar, valid)

Not according to my timings:

[~]
|2> def kern_in(x, valid):
..>     mask = np.zeros(x.shape, dtype=bool)
..>     for good in valid:
..>         mask |= (x == good)
..>     return mask
..>

[~]
|6> ar = np.random.randint(100, size=1000000)

[~]
|7> valid = np.arange(0, 100, 5)

[~]
|8> %timeit kern_in(ar, valid)
10 loops, best of 3: 115 ms per loop

[~]
|9> %timeit np.in1d(ar, valid)
1 loops, best of 3: 279 ms per loop

As valid gets larger, in1d() will catch up but for smallish sizes of
valid, which I suspect given the "non-numeric" nature of the OP's (Hi,
Brett!) request, kern_in() is usually better.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco