[Numpy-discussion] NEP for faster ufuncs

Mark Wiebe mwwiebe at gmail.com
Tue Jan 4 16:37:01 EST 2011

On Tue, Jan 4, 2011 at 4:34 AM, David Cournapeau <cournape at gmail.com> wrote:

> Ok, I took some time to look into it, but I am far from understanding
> everything yet. I will need more time.

Yeah, it ended up being pretty large.  I think the UFunc code will shrink
substantially when it uses this iterator, which is something I was

One design issue which bothers me a bit is the dynamically created
> structure for the iterator - do you have some benchmarks which show
> that this design is significantly better than a plain old C data
> structure with a couple of dynamically allocated arrays ? Besides
> bypassing the compiler type checks, I am a bit worried about the
> ability to extend the iterator through "inheritence in C" like I did
> with neighborhood iterator, but maybe I should just try it.

I know what you mean - if I could use C++ templates the implementation could
probably have the best of both worlds, but seeing as NumPy is in C I tried
to compromise mostly towards higher performance.  I don't have benchmarks
showing that the implementation is faster, but I did validate that the
compiler does the optimizations I want it to do.  For example,  the
specialized iternext function for 1 operand and 1 dimension, a common case
because of dimension coalescing, looks like this on my machine:

       0: 48 83 47 58 01       addq   $0x1,0x58(%rdi)
       5: 48 8b 47 60           mov    0x60(%rdi),%rax
       9: 48 01 47 68           add    %rax,0x68(%rdi)
       d: 48 8b 47 50           mov    0x50(%rdi),%rax
      11: 48 39 47 58           cmp    %rax,0x58(%rdi)
      15: 0f 9c c0             setl   %al
      18: 0f b6 c0             movzbl %al,%eax
      1b: c3                   retq

The function has no branches and all memory accesses are directly offset
from the iter pointer %rdi, something I think is pretty good.  If this data
was in separately allocated arrays, I think it would hurt locality as well
as add some more instructions.

In the implementation, I tried to structure the data access macros so errors
are easy to spot.  Accessing the bufferdata and the axisdata isn't typed,
but I can think of ways to do that.  I was viewing the implementation as
fully opaque to any non-iterator code, even within NumPy, do you think such
access will be necessary?

I think the code would benefit from smaller functions, too - 500+
> lines functions is just too much IMO, it should be split up.

I definitely agree, I've been splitting things up as they got large, but
that's not finished.  I also think the main iterator .c file is too large
and needs splitting up.

To get a deeper understanding of the code, I am starting to implement
> several benchmarks to compare old and new iterator - do you already
> have some of them handy ?

So far I've just done timing with the Python exposure, C-based benchmarking
is welcome.  Where possible, NPY_ITER_NO_INNER_ITERATION should be used,
since it exposes the possibility of longer inner loops with no function
calls.  An example where this is not possible is when coordinates are
required.  I should probably put together a collection of copy/paste
templates for typical use.

Thanks for the hard work, that's a really nice piece of code,

Thanks for taking the time to look into it,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110104/0c530d47/attachment.html>

More information about the NumPy-Discussion mailing list