[Numpy-discussion] NEP for faster ufuncs
mwwiebe at gmail.com
Tue Jan 4 16:37:01 EST 2011
On Tue, Jan 4, 2011 at 4:34 AM, David Cournapeau <cournape at gmail.com> wrote:
> Ok, I took some time to look into it, but I am far from understanding
> everything yet. I will need more time.
Yeah, it ended up being pretty large. I think the UFunc code will shrink
substantially when it uses this iterator, which is something I was
One design issue which bothers me a bit is the dynamically created
> structure for the iterator - do you have some benchmarks which show
> that this design is significantly better than a plain old C data
> structure with a couple of dynamically allocated arrays ? Besides
> bypassing the compiler type checks, I am a bit worried about the
> ability to extend the iterator through "inheritence in C" like I did
> with neighborhood iterator, but maybe I should just try it.
I know what you mean - if I could use C++ templates the implementation could
probably have the best of both worlds, but seeing as NumPy is in C I tried
to compromise mostly towards higher performance. I don't have benchmarks
showing that the implementation is faster, but I did validate that the
compiler does the optimizations I want it to do. For example, the
specialized iternext function for 1 operand and 1 dimension, a common case
because of dimension coalescing, looks like this on my machine:
0: 48 83 47 58 01 addq $0x1,0x58(%rdi)
5: 48 8b 47 60 mov 0x60(%rdi),%rax
9: 48 01 47 68 add %rax,0x68(%rdi)
d: 48 8b 47 50 mov 0x50(%rdi),%rax
11: 48 39 47 58 cmp %rax,0x58(%rdi)
15: 0f 9c c0 setl %al
18: 0f b6 c0 movzbl %al,%eax
1b: c3 retq
The function has no branches and all memory accesses are directly offset
from the iter pointer %rdi, something I think is pretty good. If this data
was in separately allocated arrays, I think it would hurt locality as well
as add some more instructions.
In the implementation, I tried to structure the data access macros so errors
are easy to spot. Accessing the bufferdata and the axisdata isn't typed,
but I can think of ways to do that. I was viewing the implementation as
fully opaque to any non-iterator code, even within NumPy, do you think such
access will be necessary?
I think the code would benefit from smaller functions, too - 500+
> lines functions is just too much IMO, it should be split up.
I definitely agree, I've been splitting things up as they got large, but
that's not finished. I also think the main iterator .c file is too large
and needs splitting up.
To get a deeper understanding of the code, I am starting to implement
> several benchmarks to compare old and new iterator - do you already
> have some of them handy ?
So far I've just done timing with the Python exposure, C-based benchmarking
is welcome. Where possible, NPY_ITER_NO_INNER_ITERATION should be used,
since it exposes the possibility of longer inner loops with no function
calls. An example where this is not possible is when coordinates are
required. I should probably put together a collection of copy/paste
templates for typical use.
Thanks for the hard work, that's a really nice piece of code,
Thanks for taking the time to look into it,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion