Hello NumPy-ers, After some performance analysis, I've designed and implemented a new iterator designed to speed up ufuncs and allow for easier multi-dimensional iteration. The new code is fairly large, but works quite well already. If some people could read the NEP and give some feedback, that would be great! Here's a link: https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator-ufunc.... I would also love it if someone could try building the code and play around with it a bit. The github branch is here: https://github.com/m-paradox/numpy/tree/new_iterator To give a taste of the iterator's functionality, below is an example from the NEP for how to implement a "Lambda UFunc." With just a few lines of code, it's possible to replicate something similar to the numexpr library (numexpr still gets a bigger speedup, though). In the example expression I chose, execution time went from 138ms to 61ms. Hopefully this is a good Christmas present for NumPy. :) Cheers, Mark Here is the definition of the ``luf`` function.:: def luf(lamdaexpr, *args, **kwargs): """Lambda UFunc e.g. c = luf(lambda i,j:i+j, a, b, order='K', casting='safe', buffersize=8192) c = np.empty(...) luf(lambda i,j:i+j, a, b, out=c, order='K', casting='safe', buffersize=8192) """ nargs = len(args) op = args + (kwargs.get('out',None),) it = np.newiter(op, ['buffered','no_inner_iteration'], [['readonly','nbo_aligned']]*nargs + [['writeonly','allocate','no_broadcast']], order=kwargs.get('order','K'), casting=kwargs.get('casting','safe'), buffersize=kwargs.get('buffersize',0)) while not it.finished: it[-1] = lamdaexpr(*it[:-1]) it.iternext() return it.operands[-1] Then, by using ``luf`` instead of straight Python expressions, we can gain some performance from better cache behavior.:: In [2]: a = np.random.random((50,50,50,10)) In [3]: b = np.random.random((50,50,1,10)) In [4]: c = np.random.random((50,50,50,1)) In [5]: timeit 3*a+b-(a/c) 1 loops, best of 3: 138 ms per loop In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 60.9 ms per loop In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b, c)) Out[7]: True