take from structured array is faster than boolean indexing, but reshapes columns to 2D
Dear all
Structured arrays are great, but I am having problems filtering them efficiently. Reading through the mailing list, it seems like boolean arrays are the recommended approach to filtering arrays for arbitrary conditions, but my testing shows that a combination of take and where can be much faster when dealing with structured arrays:
import timeit
setup = "from numpy import random, where, zeros; r = random.random_integers(1e3, size=1e6); q = zeros((1e6), dtype=[('foo', 'u4'), ('bar', 'u4'), ('baz', 'u4')]); q['foo'] = r" statement1 = "s = q.take(where(q['foo'] < 500))" statement2 = "s = q[q['foo'] < 500]"
t = timeit.Timer(statement1, setup) t.timeit(10) t = timeit.Timer(statement2, setup) t.timeit(10)
Using the boolean array is about 4 times slower when dealing with large arrays. In my case, these operations are supposed to happen on a web server with a large number of requests, so the efficiency gain is important.
However, the combination of take and where reshapes the columns of structured arrays to be 2dimensional:
q['foo'].shape
(1000000,)
s = q[q['foo'] < 500] s['foo'].shape
(499102,)
s = q.take(where(q['foo'] < 500)) s['foo'].shape
(1, 499102)
Is there a way to use this seemingly more efficient approach (take & where) and not have to manually reshape the columns? This seems ungainly for larger structured arrays. Or should I file this as a bug? Perhaps there are even more efficient approaches that I haven't thought of, but are obvious to others?
Thanks in advance,
Yours, Chris
participants (1)

Christopher Mutel