[Numpy-discussion] Improving NumPy's indexing / subsetting / fancy indexing implementation

Fri Apr 6 06:00:11 EDT 2012

Hi Wes,

I believe that Mark rewrote a bunch of the fancy-indexing-related code
from scratch in the masked-NA branch. I don't know if it affects
anything you're talking about here, but just as a heads up, you might
want to benchmark master, since it may have a different performance
profile.

-- Nathaniel

On Fri, Apr 6, 2012 at 4:04 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
> dear all,
>
> I've routinely found that:
>
> 1) ndarray.take is up to 1 order of magnitude faster than fancy indexing
> 2) Hand-coded Cython boolean indexing is many times faster than ndarray indexing
> 3) putmask is significantly faster than ndarray indexing
>
> For example, I  stumbled on this tonight:
>
> straightforward cython code:
>
> def row_bool_subset(ndarray[float64_t, ndim=2] values,
>                   ndarray[uint8_t, cast=True] mask):
>   cdef:
>       Py_ssize_t i, j, n, k, pos = 0
>       ndarray[float64_t, ndim=2] out
>
>   n, k = (<object> values).shape
>   assert(n == len(mask))
>
>   out = np.empty((mask.sum(), k), dtype=np.float64)
>
>   for i in range(n):
>       if mask[i]:
>           for j in range(k):
>               out[pos, j] = values[i, j]
>           pos += 1
>
>   return out
>
> In [1]: values = randn(1000000, 4)
>
> In [2]: mask = np.ones(1000000, dtype=bool)
>
> In [3]: import pandas._sandbox as sbx
>
> In [4]: result = sbx.row_bool_subset(values, mask)
>
> In [5]: timeit result = sbx.row_bool_subset(values, mask)
> 100 loops, best of 3: 9.58 ms per loop
>
> pure NumPy:
>
> In [6]: timeit values[mask]
> 10 loops, best of 3: 81.6 ms per loop
>
> Here's the kind of take performance problems that I routinely experience:
>
> In [12]: values = randn(1000000, 4)
> v
> In [13]: values.shape
> Out[13]: (1000000, 4)
>
> In [14]: indexer = np.random.permutation(1000000)[:500000]
>
> In [15]: timeit values[indexer]
> 10 loops, best of 3: 70.7 ms per loop
>
> In [16]: timeit values.take(indexer, axis=0)
> 100 loops, best of 3: 13.3 ms per loop
>
> When I can spare the time in the future I will personally work on
> these indexing routines in the C codebase, but I suspect that I'm not
> the only one adversely affected by these kinds of performance
> problems, and it might be worth thinking about a community effort to
> split up the work of retooling these methods to be more performant. We
> could use a tool like my vbench project (github.com/wesm/vbench) to
> make a list of the performance benchmarks of interest (like the ones
> above).
>
> Unfortunately I am too time constrained at least for the next 6 months
> to devote to a complete rewrite of the code in question. Possibly
> sometime in 2013 if no one has gotten to it yet, but this seems like
> someplace that we should be concerned about as the language
> performance wars continue to rage.
>
> - Wes
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion