[Numpy-discussion] numexpr with the new iterator
Mark Wiebe
mwwiebe at gmail.com
Sun Jan 9 17:45:02 EST 2011
As a benchmark of C-based iterator usage and to make it work properly in a
multi-threaded context, I've updated numexpr to use the new iterator. In
addition to some performance improvements, this also made it easy to add
optional out= and order= parameters to the evaluate function. The numexpr
repository with this update is available here:
https://github.com/m-paradox/numexpr
To use it, you need the new_iterator branch of NumPy from here:
https://github.com/m-paradox/numpy
In all cases tested, the iterator version of numexpr's evaluate function
matches or beats the standard version. The timing results are below, with
some explanatory comments placed inline:
-Mark
In [1]: import numexpr as ne
# numexpr front page example
In [2]: a = np.arange(1e6)
In [3]: b = np.arange(1e6)
In [4]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 121 ms per loop
In [5]: ne.set_num_threads(1)
# iterator version performance matches standard version
In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.8 ms per loop
In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 24.3 ms per loop
In [8]: ne.set_num_threads(2)
# iterator version performance matches standard version
In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 21 ms per loop
In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
10 loops, best of 3: 20.5 ms per loop
# numexpr front page example with a 10x bigger array
In [11]: a = np.arange(1e7)
In [12]: b = np.arange(1e7)
In [13]: ne.set_num_threads(2)
# the iterator version performance improvement is due to
# a small task scheduler tweak
In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 282 ms per loop
In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 255 ms per loop
# numexpr front page example with a Fortran contiguous array
In [16]: a = np.arange(1e7).reshape(10,100,100,100).T
In [17]: b = np.arange(1e7).reshape(10,100,100,100).T
In [18]: timeit a**2 + b**2 + 2*a*b
1 loops, best of 3: 3.22 s per loop
In [19]: ne.set_num_threads(1)
# even with a C-ordered output, the iterator version performs better
In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.74 s per loop
In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 379 ms per loop
In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 2.03 s per loop
In [23]: ne.set_num_threads(2)
# the standard version just uses 1 thread here, I believe
# the iterator version performs the same as for the flat 1e7-sized array
In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 3.92 s per loop
In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b")
1 loops, best of 3: 254 ms per loop
In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C')
1 loops, best of 3: 1.74 s per loop
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110109/9e74517b/attachment.html>
More information about the NumPy-Discussion
mailing list