[Numpy-discussion] numexpr with the new iterator

Sun Jan 9 17:45:02 EST 2011

As a benchmark of C-based iterator usage and to make it work properly in a
multi-threaded context, I've updated numexpr to use the new iterator.  In
addition to some performance improvements, this also made it easy to add
optional out= and order= parameters to the evaluate function.  The numexpr
repository with this update is available here:

https://github.com/m-paradox/numexpr

To use it, you need the new_iterator branch of NumPy from here:

https://github.com/m-paradox/numpy

In all cases tested, the iterator version of numexpr's evaluate function
matches or beats the standard version.  The timing results are below, with
some explanatory comments placed inline:

-Mark

In [1]: import numexpr as ne

# numexpr front page example

In [2]: a = np.arange(1e6)
In [3]: b = np.arange(1e6)

In [4]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 121 ms per loop

In [5]: ne.set_num_threads(1)

# iterator version performance matches standard version

In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.8 ms per loop
In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.3 ms per loop

In [8]: ne.set_num_threads(2)

# iterator version performance matches standard version

In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 21 ms per loop
In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 20.5 ms per loop

# numexpr front page example with a 10x bigger array

In [11]: a = np.arange(1e7)
In [12]: b = np.arange(1e7)

In [13]: ne.set_num_threads(2)

# the iterator version performance improvement is due to
# a small task scheduler tweak

In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 282 ms per loop
In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 255 ms per loop

# numexpr front page example with a Fortran contiguous array

In [16]: a = np.arange(1e7).reshape(10,100,100,100).T
In [17]: b = np.arange(1e7).reshape(10,100,100,100).T

In [18]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 3.22 s per loop

In [19]: ne.set_num_threads(1)

# even with a C-ordered output, the iterator version performs better

In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.74 s per loop
In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 379 ms per loop
In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 2.03 s per loop

In [23]: ne.set_num_threads(2)

# the standard version just uses 1 thread here, I believe
# the iterator version performs the same as for the flat 1e7-sized array

In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.92 s per loop
In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 254 ms per loop
In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 1.74 s per loop
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110109/9e74517b/attachment.html>