Mailman 3 NEP for faster ufuncs - NumPy-Discussion

NEP for faster ufuncs

Mark Wiebe

22 Dec 2010 22 Dec '10

1:53 a.m.

Hello NumPy-ers, After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link: https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.... I would also love it if someone could try building the code and play around with it a bit. The github branch is here: https://github.com/m-paradox/numpy/tree/new_iterator To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms. Hopefully this is a good Christmas present for NumPy. :) Cheers, Mark Here is the definition of the ``luf`` function.:: def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192) c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """ nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext() return it.operands[-1] Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.:: In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1)) In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

Attachments:

attachment.htm (text/html — 6.7 KB)

Show replies by date

John Salvatier

22 Dec 22 Dec

1:59 a.m.

That is an amazing christmas present. On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc....

I would also love it if someone could try building the code and play around with it a bit. The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers, Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc

e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192)

c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """

nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

John Salvatier

3 a.m.

I applaud you on your vision. I only have one small suggestion: I suggest you put a table of contents at the beginning of your NEP so people may skip to the part that most interests them. On Tue, Dec 21, 2010 at 4:59 PM, John Salvatier <jsalvati@u.washington.edu>wrote:

...

That is an amazing christmas present.

On Tue, Dec 21, 2010 at 4:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...
Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc....

I would also love it if someone could try building the code and play around with it a bit. The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers, Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc

e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192)

c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """

nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Mark Wiebe

3:15 a.m.

That's a good suggestion - added. Unfortunately, it looks like the github rst converter doesn't make a table of contents with working links. Cheers, Mark On Tue, Dec 21, 2010 at 6:00 PM, John Salvatier <jsalvati@u.washington.edu>wrote:

...

I applaud you on your vision. I only have one small suggestion: I suggest you put a table of contents at the beginning of your NEP so people may skip to the part that most interests them.

...

<snip>

David

3:06 a.m.

Hi Mark, On 12/22/2010 09:53 AM, Mark Wiebe wrote:

...

Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc....

This looks pretty cool. I hope to be able to take a look at it during the christmas holidays. I cannot comment in details yet, but it seems to address several issues I encountered myself while implementing the neighborhood iterator (which I will try to update to use the new one). One question: which CPU/platform did you test it on ? cheers, David

Mark Wiebe

3:20 a.m.

On Tue, Dec 21, 2010 at 6:06 PM, David <david@silveregg.co.jp> wrote:

...

<snip>

This looks pretty cool. I hope to be able to take a look at it during

...

the christmas holidays.

Thanks!

...

I cannot comment in details yet, but it seems to address several issues I encountered myself while implementing the neighborhood iterator (which I will try to update to use the new one).

One question: which CPU/platform did you test it on ?

The system I'm testing on is a bit old: a dual core Athlon 64 X2 4200+ on 64-bit Fedora 13, gcc 4.4.5. -Mark

David Cournapeau

4 Jan 4 Jan

1:34 p.m.

On Wed, Dec 22, 2010 at 11:20 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

On Tue, Dec 21, 2010 at 6:06 PM, David <david@silveregg.co.jp> wrote:

...
<snip>

This looks pretty cool. I hope to be able to take a look at it during the christmas holidays.

Thanks!

Ok, I took some time to look into it, but I am far from understanding everything yet. I will need more time. One design issue which bothers me a bit is the dynamically created structure for the iterator - do you have some benchmarks which show that this design is significantly better than a plain old C data structure with a couple of dynamically allocated arrays ? Besides bypassing the compiler type checks, I am a bit worried about the ability to extend the iterator through "inheritence in C" like I did with neighborhood iterator, but maybe I should just try it. I think the code would benefit from smaller functions, too - 500+ lines functions is just too much IMO, it should be split up. To get a deeper understanding of the code, I am starting to implement several benchmarks to compare old and new iterator - do you already have some of them handy ? Thanks for the hard work, that's a really nice piece of code, David

Mark Wiebe

10:37 p.m.

On Tue, Jan 4, 2011 at 4:34 AM, David Cournapeau <cournape@gmail.com> wrote:

...

Ok, I took some time to look into it, but I am far from understanding everything yet. I will need more time.

Yeah, it ended up being pretty large. I think the UFunc code will shrink substantially when it uses this iterator, which is something I was targeting. One design issue which bothers me a bit is the dynamically created

...

structure for the iterator - do you have some benchmarks which show that this design is significantly better than a plain old C data structure with a couple of dynamically allocated arrays ? Besides bypassing the compiler type checks, I am a bit worried about the ability to extend the iterator through "inheritence in C" like I did with neighborhood iterator, but maybe I should just try it.

I know what you mean - if I could use C++ templates the implementation could probably have the best of both worlds, but seeing as NumPy is in C I tried to compromise mostly towards higher performance. I don't have benchmarks showing that the implementation is faster, but I did validate that the compiler does the optimizations I want it to do. For example, the specialized iternext function for 1 operand and 1 dimension, a common case because of dimension coalescing, looks like this on my machine: 0: 48 83 47 58 01 addq $0x1,0x58(%rdi) 5: 48 8b 47 60 mov 0x60(%rdi),%rax 9: 48 01 47 68 add %rax,0x68(%rdi) d: 48 8b 47 50 mov 0x50(%rdi),%rax 11: 48 39 47 58 cmp %rax,0x58(%rdi) 15: 0f 9c c0 setl %al 18: 0f b6 c0 movzbl %al,%eax 1b: c3 retq The function has no branches and all memory accesses are directly offset from the iter pointer %rdi, something I think is pretty good. If this data was in separately allocated arrays, I think it would hurt locality as well as add some more instructions. In the implementation, I tried to structure the data access macros so errors are easy to spot. Accessing the bufferdata and the axisdata isn't typed, but I can think of ways to do that. I was viewing the implementation as fully opaque to any non-iterator code, even within NumPy, do you think such access will be necessary? I think the code would benefit from smaller functions, too - 500+

...

lines functions is just too much IMO, it should be split up.

I definitely agree, I've been splitting things up as they got large, but that's not finished. I also think the main iterator .c file is too large and needs splitting up. To get a deeper understanding of the code, I am starting to implement

...

several benchmarks to compare old and new iterator - do you already have some of them handy ?

So far I've just done timing with the Python exposure, C-based benchmarking is welcome. Where possible, NPY_ITER_NO_INNER_ITERATION should be used, since it exposes the possibility of longer inner loops with no function calls. An example where this is not possible is when coordinates are required. I should probably put together a collection of copy/paste templates for typical use. Thanks for the hard work, that's a really nice piece of code,

...

Thanks for taking the time to look into it, Mark

Francesc Alted

5 Jan 5 Jan

1:26 p.m.

2011/1/4, Mark Wiebe <mwwiebe@gmail.com>:

...

...
To get a deeper understanding of the code, I am starting to implement several benchmarks to compare old and new iterator - do you already have some of them handy ?

So far I've just done timing with the Python exposure, C-based benchmarking is welcome. Where possible, NPY_ITER_NO_INNER_ITERATION should be used, since it exposes the possibility of longer inner loops with no function calls. An example where this is not possible is when coordinates are required. I should probably put together a collection of copy/paste templates for typical use.

Sorry for the naive question, but I use the numpy.fromiter() iterator quite a few in my projects. and I'm curious on whether this new iterator would allow numpy.fromiter() to go faster (I mean, in Python space). Any hint? -- Francesc Alted

Mark Wiebe

7:01 p.m.

On Wed, Jan 5, 2011 at 4:26 AM, Francesc Alted <faltet@pytables.org> wrote:

...

Sorry for the naive question, but I use the numpy.fromiter() iterator quite a few in my projects. and I'm curious on whether this new iterator would allow numpy.fromiter() to go faster (I mean, in Python space). Any hint?

The new iterator doesn't offer any help to fromiter in general, but if the iterator being given to the function is a new iterator it would be possible to handle it specially and get a big speedup. -Mark

Francesc Alted

6 Jan 6 Jan

10:48 a.m.

2011/1/5, Mark Wiebe <mwwiebe@gmail.com>:

...

On Wed, Jan 5, 2011 at 4:26 AM, Francesc Alted <faltet@pytables.org> wrote:

...
Sorry for the naive question, but I use the numpy.fromiter() iterator quite a few in my projects. and I'm curious on whether this new iterator would allow numpy.fromiter() to go faster (I mean, in Python space). Any hint?

The new iterator doesn't offer any help to fromiter in general, but if the iterator being given to the function is a new iterator it would be possible to handle it specially and get a big speedup.

Ah, that's what I thought. Thanks for the clarification. -- Francesc Alted

Charles R Harris

22 Dec 22 Dec

7:05 a.m.

On Tue, Dec 21, 2010 at 5:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

...

Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc....

I would also love it if someone could try building the code and play around with it a bit. The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers, Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc

e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192)

c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """

nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

Wow, that's a really nice design and write up. Small typo: /* Only allow exactly equivalent types */ NPY_NO_CASTING=0, /* Allow casts between equivalent types of different byte orders */ NPY_EQUIV_CASTING=0, Chuck

Mark Wiebe

5:21 p.m.

On Tue, Dec 21, 2010 at 10:05 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...

<snip> Wow, that's a really nice design and write up. Small typo:

/* Only allow exactly equivalent types */ NPY_NO_CASTING=0, /* Allow casts between equivalent types of different byte orders */

NPY_EQUIV_CASTING=0,

Good catch, turns out the test that should have caught it was broken too.

-Mark

Francesc Alted

9:21 a.m.

A Wednesday 22 December 2010 01:53:55 Mark Wiebe escrigué:

...

Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator -ufunc.rst

I would also love it if someone could try building the code and play around with it a bit. The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers, Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc

e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192)

c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """

nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs +

[['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

Wow, really nice work! It would be great if that could make into NumPy :-) Regarding your comment on numexpr being faster, I'm not sure (your new_iterator branch does not work for me; it gives me an error like: AttributeError: 'module' object has no attribute 'newiter'), but my guess is that your approach seems actually faster:

...

...
...
a = np.random.random((50,50,50,10)) b = np.random.random((50,50,1,10)) c = np.random.random((50,50,50,1)) timeit 3*a+b-(a/c) 10 loops, best of 3: 67.5 ms per loop import numexpr as ne ne.evaluate("3*a+b-(a/c) timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop

i.e. numexpr is not able to achieve the 2x speedup mark that you are getting with ``luf`` (using a Core2 @ 3 GHz here). -- Francesc Alted

Mark Wiebe

9:48 a.m.

On Wed, Dec 22, 2010 at 12:21 AM, Francesc Alted <faltet@pytables.org>wrote:

...

<snip> Wow, really nice work! It would be great if that could make into NumPy :-) Regarding your comment on numexpr being faster, I'm not sure (your new_iterator branch does not work for me; it gives me an error like: AttributeError: 'module' object has no attribute 'newiter'),

What are you using to build it? So far I've just modified the setup.py scripts, I still need to add it to numscons.

...

but my guess is that your approach seems actually faster:

...
...
...
a = np.random.random((50,50,50,10)) b = np.random.random((50,50,1,10)) c = np.random.random((50,50,50,1)) timeit 3*a+b-(a/c) 10 loops, best of 3: 67.5 ms per loop import numexpr as ne ne.evaluate("3*a+b-(a/c) timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop

i.e. numexpr is not able to achieve the 2x speedup mark that you are getting with ``luf`` (using a Core2 @ 3 GHz here).

That's promising! I based my assertion on getting a slower speedup than numexpr does on their front page example. -Mark

Francesc Alted

9:54 a.m.

A Wednesday 22 December 2010 09:48:03 Mark Wiebe escrigué:

...

On Wed, Dec 22, 2010 at 12:21 AM, Francesc Alted <faltet@pytables.org>wrote:

...
<snip> Wow, really nice work! It would be great if that could make into NumPy

:-) Regarding your comment on numexpr being faster, I'm not sure :(your

new_iterator branch does not work for me; it gives me an error like: AttributeError: 'module' object has no attribute 'newiter'),

What are you using to build it? So far I've just modified the setup.py scripts, I still need to add it to numscons.

Well, just the typical "git clone ...; python setup.py install" dance.

...

...
i.e. numexpr is not able to achieve the 2x speedup mark that you are getting with ``luf`` (using a Core2 @ 3 GHz here).

That's promising! I based my assertion on getting a slower speedup than numexpr does on their front page example.

I see :-) Well, I'd think that numexpr is not specially efficient when handling broadcasting, so this might be the reason your approach is faster. I suppose that with operands with the same shape, things might look different. -- Francesc Alted

Mark Wiebe

5:25 p.m.

On Wed, Dec 22, 2010 at 12:54 AM, Francesc Alted <faltet@pytables.org>wrote:

...

A Wednesday 22 December 2010 09:48:03 Mark Wiebe escrigué:

...
On Wed, Dec 22, 2010 at 12:21 AM, Francesc Alted <faltet@pytables.org>wrote:

...
<snip> new_iterator branch does not work for me; it gives me an error like: AttributeError: 'module' object has no attribute 'newiter'),

What are you using to build it? So far I've just modified the setup.py scripts, I still need to add it to numscons.

Well, just the typical "git clone ...; python setup.py install" dance.

Can you print out your np.__version__, and try running the tests? If newiter didn't build for some reason, its tests should be throwing a bunch of exceptions.

...

...
...
i.e. numexpr is not able to achieve the 2x speedup mark that you are getting with ``luf`` (using a Core2 @ 3 GHz here).

That's promising! I based my assertion on getting a slower speedup than numexpr does on their front page example.

I see :-) Well, I'd think that numexpr is not specially efficient when handling broadcasting, so this might be the reason your approach is faster. I suppose that with operands with the same shape, things might look different.

I haven't looked at the numexpr code, but I think the ufuncs will need SSE versions to make up part of the remaining difference. -Mark

Francesc Alted

6:07 p.m.

A Wednesday 22 December 2010 17:25:13 Mark Wiebe escrigué:

...

Can you print out your np.__version__, and try running the tests? If newiter didn't build for some reason, its tests should be throwing a bunch of exceptions.

I'm a bit swamped now. Let's see if I can do that later on.

...

...
I see :-) Well, I'd think that numexpr is not specially efficient when handling broadcasting, so this might be the reason your approach is faster. I suppose that with operands with the same shape, things might look different.

I haven't looked at the numexpr code, but I think the ufuncs will need SSE versions to make up part of the remaining difference.

Uh, I doubt that SSE can do a lot for accelerating operations like 3*a+b-(a/c), as this computation is mainly bounded by memory (although threading does certainly help). Numexpr can use SSE only via Intel's VML, which is very good for accelerating the computation of transcendental functions (sin, cos, sqrt, exp, log...). -- Francesc Alted

Mark Wiebe

6:21 p.m.

On Wed, Dec 22, 2010 at 9:07 AM, Francesc Alted <faltet@pytables.org> wrote:

...

A Wednesday 22 December 2010 17:25:13 Mark Wiebe escrigué:

...
Can you print out your np.__version__, and try running the tests? If newiter didn't build for some reason, its tests should be throwing a bunch of exceptions.

I'm a bit swamped now. Let's see if I can do that later on.

Ok.

...

...
I see :-) Well, I'd think that numexpr is not specially efficient

...
when handling broadcasting, so this might be the reason your approach is faster. I suppose that with operands with the same shape, things might look different.

I haven't looked at the numexpr code, but I think the ufuncs will need SSE versions to make up part of the remaining difference.

Uh, I doubt that SSE can do a lot for accelerating operations like 3*a+b-(a/c), as this computation is mainly bounded by memory (although threading does certainly help). Numexpr can use SSE only via Intel's VML, which is very good for accelerating the computation of transcendental functions (sin, cos, sqrt, exp, log...).

The reason I think it might help is that with 'luf' is that it's calculating the expression on smaller sized arrays, which possibly just got buffered. If the memory allocator for the temporaries keeps giving back the same addresses, all this will be in one of the caches very close to the CPU. Unless this cache is still too slow to feed the SSE instructions, there should be a speed benefit. The ufunc inner loops could also use the SSE prefetch instructions based on the stride to give some strong hints about where the next memory bytes to use will be. -Mark

Francesc Alted

7:41 p.m.

A Wednesday 22 December 2010 18:21:28 Mark Wiebe escrigué:

...

On Wed, Dec 22, 2010 at 9:07 AM, Francesc Alted <faltet@pytables.org> wrote:

...
A Wednesday 22 December 2010 17:25:13 Mark Wiebe escrigué:

...
Can you print out your np.__version__, and try running the tests? If newiter didn't build for some reason, its tests should be throwing a bunch of exceptions.

$ PYTHONPATH=numpy python -c "import numpy; numpy.test()" Running unit tests for numpy NumPy version 2.0.0.dev-147f817 NumPy is installed in /tmp/numpy/numpy Python version 2.6.1 (r261:67515, Feb 3 2009, 17:34:37) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] nose version 0.11.0 [clip] Warning: divide by zero encountered in log Warning: divide by zero encountered in log [clip] Ran 3094 tests in 16.771s OK (KNOWNFAIL=4, SKIP=1) IPython seems to work well too:

...

...
...
np.__version__ '2.0.0.dev-147f817' timeit 3*a+b-(a/c) 10 loops, best of 3: 67.5 ms per loop

However, when trying you luf function:

...

...
...
cpaste [the luf code here] -- timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) [clip] AttributeError: 'module' object has no attribute 'newiter'

...

The reason I think it might help is that with 'luf' is that it's calculating the expression on smaller sized arrays, which possibly just got buffered. If the memory allocator for the temporaries keeps giving back the same addresses, all this will be in one of the caches very close to the CPU. Unless this cache is still too slow to feed the SSE instructions, there should be a speed benefit. The ufunc inner loops could also use the SSE prefetch instructions based on the stride to give some strong hints about where the next memory bytes to use will be.

Ah, okay. However, Numexpr is not meant to accelerate calculations with small operands. I suppose that this is where your new iterator makes more sense: accelerating operations where some of the operands are small (i.e. fit in cache) and have to be broadcasted to match the dimensionality of the others. -- Francesc Alted

Mark Wiebe

7:52 p.m.

On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted <faltet@pytables.org>wrote:

...

NumPy version 2.0.0.dev-147f817

There's your problem, it looks like the PYTHONPATH isn't seeing your new build for some reason. That build is off of this commit in the NumPy master branch: https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953a51d533cc27...

...

The reason I think it might help is that with 'luf' is that it's

...
calculating the expression on smaller sized arrays, which possibly just got buffered. If the memory allocator for the temporaries keeps giving back the same addresses, all this will be in one of the caches very close to the CPU. Unless this cache is still too slow to feed the SSE instructions, there should be a speed benefit. The ufunc inner loops could also use the SSE prefetch instructions based on the stride to give some strong hints about where the next memory bytes to use will be.

Ah, okay. However, Numexpr is not meant to accelerate calculations with small operands. I suppose that this is where your new iterator makes more sense: accelerating operations where some of the operands are small (i.e. fit in cache) and have to be broadcasted to match the dimensionality of the others.

It's not about small operands, but small chunks of the operands at a time, with temporary arrays for intermediate calculations. It's the small chunks + temporaries which must fit in cache to get the benefit, not the whole array. The numexpr front page explains this fairly well in the section "Why It Works": http://code.google.com/p/numexpr/#Why_It_Works -Mark

Francesc Alted

8:16 p.m.

A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:

...

On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted <faltet@pytables.org>wrote:

...
NumPy version 2.0.0.dev-147f817

There's your problem, it looks like the PYTHONPATH isn't seeing your new build for some reason. That build is off of this commit in the NumPy master branch:

https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953a51d 533cc27ec

Uh, I think I'm a bit lost here. I've cloned this repo: $ git clone git://github.com/m-paradox/numpy.git Is that wrong?

...

...
Ah, okay. However, Numexpr is not meant to accelerate calculations with small operands. I suppose that this is where your new iterator makes more sense: accelerating operations where some of the operands are small (i.e. fit in cache) and have to be broadcasted to match the dimensionality of the others.

It's not about small operands, but small chunks of the operands at a time, with temporary arrays for intermediate calculations. It's the small chunks + temporaries which must fit in cache to get the benefit, not the whole array.

But you need to transport those small chunks from main memory to cache before you can start doing the computation for this piece, right? This is what I'm saying that the bottleneck for evaluating arbitrary expressions (like "3*a+b-(a/c)", i.e. not including transcendental functions, nor broadcasting) is memory bandwidth (and more in particular RAM bandwidth).

...

The numexpr front page explains this fairly well in the section "Why It Works":

http://code.google.com/p/numexpr/#Why_It_Works

I know. I wrote that part (based on the notes by David Cooke, the original author ;-) -- Francesc Alted

Mark Wiebe

8:42 p.m.

On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted <faltet@pytables.org>wrote:

...

A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:

...
On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted <faltet@pytables.org>wrote:

...
NumPy version 2.0.0.dev-147f817

There's your problem, it looks like the PYTHONPATH isn't seeing your new build for some reason. That build is off of this commit in the NumPy master branch:

https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953a51d 533cc27ec

Uh, I think I'm a bit lost here. I've cloned this repo:

$ git clone git://github.com/m-paradox/numpy.git

Is that wrong?

That's right, it was my mistake to assume that the page for a branch on github would give you that branch. You need the 'new_iterator' branch, so after that clone, you should do this: $ git checkout origin/new_iterator

...

...
Ah, okay. However, Numexpr is not meant to accelerate calculations

...
with small operands. I suppose that this is where your new iterator makes more sense: accelerating operations where some of the operands are small (i.e. fit in cache) and have to be broadcasted to match the dimensionality of the others.

It's not about small operands, but small chunks of the operands at a time, with temporary arrays for intermediate calculations. It's the small chunks + temporaries which must fit in cache to get the benefit, not the whole array.

But you need to transport those small chunks from main memory to cache before you can start doing the computation for this piece, right? This is what I'm saying that the bottleneck for evaluating arbitrary expressions (like "3*a+b-(a/c)", i.e. not including transcendental functions, nor broadcasting) is memory bandwidth (and more in particular RAM bandwidth).

In the example expression, I believe the evaluation would go something like this. Assuming the memory allocator keeps giving back the same locations to 'luf', all temporary variables will already be in cache after the first chunk. temp1 = 3 * a # a is read from main memory temp2 = temp1 + b # b is read from main memory temp3 = a / c # a is already in cache, c is read from main memory result = temp2 + temp3 # result is written to data from main memory So there are 4 reads and writes to chunks from outside of the cache, but 12 total reads and writes to chunks, so speeding up the parts already in cache would appear to be beneficial. The benefit will get better with more complicated expressions. I think as long as the operation is slower than a memcpy, the RAM bandwidth isn't the main bottleneck to be concerned with, but instead produces an upper bound on performance. I'm not sure how to precisely measure that overhead, though.

...

...
The numexpr front page explains this fairly well in the section "Why It Works":

http://code.google.com/p/numexpr/#Why_It_Works

I know. I wrote that part (based on the notes by David Cooke, the original author ;-)

Cool :) -Mark

Francesc Alted

9:05 p.m.

A Wednesday 22 December 2010 20:42:54 Mark Wiebe escrigué:

...

On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted <faltet@pytables.org>wrote:

...
A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:

...
On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted

<faltet@pytables.org>wrote:

...
...
NumPy version 2.0.0.dev-147f817

There's your problem, it looks like the PYTHONPATH isn't seeing your new build for some reason. That build is off of this commit in the NumPy master branch:

https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953 a51d 533cc27ec

Uh, I think I'm a bit lost here. I've cloned this repo:

$ git clone git://github.com/m-paradox/numpy.git

Is that wrong?

That's right, it was my mistake to assume that the page for a branch on github would give you that branch. You need the 'new_iterator' branch, so after that clone, you should do this:

$ git checkout origin/new_iterator

Ah, things go well now:

...

...
...
timeit 3*a+b-(a/c) 10 loops, best of 3: 67.7 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 27.8 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop

So, yup, I'm seeing the good speedup here too :-)

...

...
But you need to transport those small chunks from main memory to cache before you can start doing the computation for this piece, right? This is what I'm saying that the bottleneck for evaluating arbitrary expressions (like "3*a+b-(a/c)", i.e. not including transcendental functions, nor broadcasting) is memory bandwidth (and more in particular RAM bandwidth).

In the example expression, I believe the evaluation would go something like this. Assuming the memory allocator keeps giving back the same locations to 'luf', all temporary variables will already be in cache after the first chunk.

temp1 = 3 * a # a is read from main memory temp2 = temp1 + b # b is read from main memory temp3 = a / c # a is already in cache, c is read from main memory result = temp2 + temp3 # result is written to data from main memory

So there are 4 reads and writes to chunks from outside of the cache, but 12 total reads and writes to chunks, so speeding up the parts already in cache would appear to be beneficial. The benefit will get better with more complicated expressions. I think as long as the operation is slower than a memcpy, the RAM bandwidth isn't the main bottleneck to be concerned with, but instead produces an upper bound on performance. I'm not sure how to precisely measure that overhead, though.

Well, see the timings for the non-broadcasting case:

...

...
...
a = np.random.random((50,50,50,10)) b = np.random.random((50,50,50,10)) c = np.random.random((50,50,50,10))

...

...
...
timeit 3*a+b-(a/c) 10 loops, best of 3: 31.1 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 24.5 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 10.4 ms per loop

However, the above comparison is not fair, as numexpr uses all your cores by default (2 for the case above). If we force using only one core:

...

...
...
ne.set_num_threads(1) timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 16 ms per loop

which is still faster than luf. In this case numexpr was not using SSE, but in case luf does so, this does not imply better speed. -- Francesc Alted

Mark Wiebe

9:42 p.m.

On Wed, Dec 22, 2010 at 12:05 PM, Francesc Alted <faltet@pytables.org>wrote:

...

<snip>

Ah, things go well now:

...
...
...
timeit 3*a+b-(a/c) 10 loops, best of 3: 67.7 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 27.8 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop

So, yup, I'm seeing the good speedup here too :-)

Great! <snip>

...

Well, see the timings for the non-broadcasting case:

...
...
...
a = np.random.random((50,50,50,10)) b = np.random.random((50,50,50,10)) c = np.random.random((50,50,50,10))

...
...
...
timeit 3*a+b-(a/c) 10 loops, best of 3: 31.1 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 24.5 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 10.4 ms per loop

However, the above comparison is not fair, as numexpr uses all your cores by default (2 for the case above). If we force using only one core:

...
...
...
ne.set_num_threads(1) timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 16 ms per loop

which is still faster than luf. In this case numexpr was not using SSE, but in case luf does so, this does not imply better speed.

Ok, I get pretty close to the same ratios (and my machine feels a bit slow...): In [6]: timeit 3*a+b-(a/c) 10 loops, best of 3: 101 ms per loop In [7]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 53.4 ms per loop In [8]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 27.8 ms per loop In [9]: ne.set_num_threads(1) In [10]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 33.6 ms per loop I think the closest to a "memcpy" we can do here would be just adding, which shows the expression evaluation can be estimated to have 20% overhead. While that's small compared the speedup over straight NumPy, I think it's still worth considering. In [11]: timeit ne.evaluate("a+b+c") 10 loops, best of 3: 27.9 ms per loop Even just switching from add to divide gives more than 10% overhead. With SSE2 these divides could be done two at a time for doubles or four at a time for floats to cut that down. In [12]: timeit ne.evaluate("a/b/c") 10 loops, best of 3: 31.7 ms per loop This all shows that the 'luf' Python interpreter overhead is still pretty big, the new iterator can't defeat numexpr by itself. I think numexpr could get a nice boost from using the new iterator internally though - if I go back to the original motivation, different memory orderings, 'luf' is 10x faster than single-threaded numexpr. In [15]: a = np.random.random((50,50,50,10)).T In [16]: b = np.random.random((50,50,50,10)).T In [17]: c = np.random.random((50,50,50,10)).T In [18]: timeit ne.evaluate("3*a+b-(a/c)") 1 loops, best of 3: 556 ms per loop In [19]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 52.5 ms per loop Cheers, Mark

Mark Wiebe

10:02 p.m.

On Wed, Dec 22, 2010 at 12:42 PM, Mark Wiebe <mwwiebe@gmail.com> wrote: <snip>

...

I think numexpr could get a nice boost from using the new iterator internally though

There's actually a trivial way to do this with very minimal changes to numexpr - the 'itview' mechanism. Create the new iterator, call NpyIter_GetIterView(it,i) (or it.itviews in Python) to get compatibly reordered views of the inputs, then continue with the existing code. -Mark

Francesc Alted

10:20 p.m.

A Wednesday 22 December 2010 22:02:03 Mark Wiebe escrigué:

...

On Wed, Dec 22, 2010 at 12:42 PM, Mark Wiebe <mwwiebe@gmail.com> wrote: <snip>

...
I think numexpr could get a nice boost from using the new iterator internally though

There's actually a trivial way to do this with very minimal changes to numexpr - the 'itview' mechanism. Create the new iterator, call NpyIter_GetIterView(it,i) (or it.itviews in Python) to get compatibly reordered views of the inputs, then continue with the existing code.

That's interesting. I'll think about this (patches are very welcome too!). Thanks! -- Francesc Alted

Travis Oliphant

24 Dec 24 Dec

6:24 a.m.

This is very cool! I would like to see this get into NumPy 2.0. Thanks for all the great work! -Travis On Dec 21, 2010, at 6:53 PM, Mark Wiebe wrote:

...

Hello NumPy-ers,

After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link:

https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc....

I would also love it if someone could try building the code and play around with it a bit. The github branch is here:

https://github.com/m-paradox/numpy/tree/new_iterator

To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms.

Hopefully this is a good Christmas present for NumPy. :)

Cheers, Mark

Here is the definition of the ``luf`` function.::

def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc

e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192)

c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """

nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext()

return it.operands[-1]

Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.::

In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1))

In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop

In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop

In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

--- Travis Oliphant Enthought, Inc. oliphant@enthought.com 1-512-536-1057 http://www.enthought.com

5123

Age (days ago)

5138

Last active (days ago)

List overview

Download

27 comments

7 participants

participants (7)

Charles R Harris
David
David Cournapeau
Francesc Alted
John Salvatier
Mark Wiebe
Travis Oliphant

NEP for faster ufuncs

John Salvatier

John Salvatier

David

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

Francesc Alted

tags

participants (7)