Re: [Numpy-discussion] NEP for faster ufuncs

Dec. 22, 2010

      On Wed, Dec 22, 2010 at 12:05 PM, Francesc Alted <faltet@pytables.org>wrote:
...
<snip>
Ah, things go well now:
...
...
...
timeit 3*a+b-(a/c)
10 loops, best of 3: 67.7 ms per loop
timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 27.8 ms per loop
timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 42.8 ms per loop
So, yup, I'm seeing the good speedup here too :-)
Great!

<snip>
...
Well, see the timings for the non-broadcasting case:
...
...
...
a = np.random.random((50,50,50,10))
b = np.random.random((50,50,50,10))
c = np.random.random((50,50,50,10))
...
...
...
timeit 3*a+b-(a/c)
10 loops, best of 3: 31.1 ms per loop
timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 24.5 ms per loop
timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 10.4 ms per loop
However, the above comparison is not fair, as numexpr uses all your
cores by default (2 for the case above).  If we force using only one
core:
...
...
...
ne.set_num_threads(1)
timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 16 ms per loop
which is still faster than luf.  In this case numexpr was not using SSE,
but in case luf does so, this does not imply better speed.
Ok, I get pretty close to the same ratios (and my machine feels a bit
slow...):

In [6]: timeit 3*a+b-(a/c)
10 loops, best of 3: 101 ms per loop

In [7]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 53.4 ms per loop

In [8]: timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 27.8 ms per loop

In [9]: ne.set_num_threads(1)
In [10]: timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 33.6 ms per loop

I think the closest to a "memcpy" we can do here would be just adding, which
shows the expression evaluation can be estimated to have 20% overhead.
 While that's small compared the speedup over straight NumPy, I think it's
still worth considering.

In [11]: timeit ne.evaluate("a+b+c")
10 loops, best of 3: 27.9 ms per loop

Even just switching from add to divide gives more than 10% overhead.  With
SSE2 these divides could be done two at a time for doubles or four at a time
for floats to cut that down.

In [12]: timeit ne.evaluate("a/b/c")
10 loops, best of 3: 31.7 ms per loop

This all shows that the 'luf' Python interpreter overhead is still pretty
big, the new iterator can't defeat numexpr by itself.  I think numexpr could
get a nice boost from using the new iterator internally though - if I go
back to the original motivation, different memory orderings, 'luf' is 10x
faster than single-threaded numexpr.

In [15]: a = np.random.random((50,50,50,10)).T
In [16]: b = np.random.random((50,50,50,10)).T
In [17]: c = np.random.random((50,50,50,10)).T

In [18]: timeit ne.evaluate("3*a+b-(a/c)")
1 loops, best of 3: 556 ms per loop

In [19]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 52.5 ms per loop

Cheers,
Mark

Re: [Numpy-discussion] NEP for faster ufuncs

Mark Wiebe