On Wed, Dec 22, 2010 at 12:05 PM, Francesc Alted <faltet@pytables.org>wrote:
<snip>
Ah, things go well now:
timeit 3*a+b-(a/c) 10 loops, best of 3: 67.7 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 27.8 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop
So, yup, I'm seeing the good speedup here too :-)
Great! <snip>
Well, see the timings for the non-broadcasting case:
a = np.random.random((50,50,50,10)) b = np.random.random((50,50,50,10)) c = np.random.random((50,50,50,10))
timeit 3*a+b-(a/c) 10 loops, best of 3: 31.1 ms per loop timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 24.5 ms per loop timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 10.4 ms per loop
However, the above comparison is not fair, as numexpr uses all your cores by default (2 for the case above). If we force using only one core:
ne.set_num_threads(1) timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 16 ms per loop
which is still faster than luf. In this case numexpr was not using SSE, but in case luf does so, this does not imply better speed.
Ok, I get pretty close to the same ratios (and my machine feels a bit slow...): In [6]: timeit 3*a+b-(a/c) 10 loops, best of 3: 101 ms per loop In [7]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 53.4 ms per loop In [8]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 27.8 ms per loop In [9]: ne.set_num_threads(1) In [10]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 33.6 ms per loop I think the closest to a "memcpy" we can do here would be just adding, which shows the expression evaluation can be estimated to have 20% overhead. While that's small compared the speedup over straight NumPy, I think it's still worth considering. In [11]: timeit ne.evaluate("a+b+c") 10 loops, best of 3: 27.9 ms per loop Even just switching from add to divide gives more than 10% overhead. With SSE2 these divides could be done two at a time for doubles or four at a time for floats to cut that down. In [12]: timeit ne.evaluate("a/b/c") 10 loops, best of 3: 31.7 ms per loop This all shows that the 'luf' Python interpreter overhead is still pretty big, the new iterator can't defeat numexpr by itself. I think numexpr could get a nice boost from using the new iterator internally though - if I go back to the original motivation, different memory orderings, 'luf' is 10x faster than single-threaded numexpr. In [15]: a = np.random.random((50,50,50,10)).T In [16]: b = np.random.random((50,50,50,10)).T In [17]: c = np.random.random((50,50,50,10)).T In [18]: timeit ne.evaluate("3*a+b-(a/c)") 1 loops, best of 3: 556 ms per loop In [19]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 52.5 ms per loop Cheers, Mark