[Numpy-discussion] NEP for faster ufuncs

Wed Dec 22 15:42:43 EST 2010

On Wed, Dec 22, 2010 at 12:05 PM, Francesc Alted <faltet at pytables.org>wrote:

> <snip>
>
> Ah, things go well now:
>
> >>> timeit 3*a+b-(a/c)
> 10 loops, best of 3: 67.7 ms per loop
> >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
> 10 loops, best of 3: 27.8 ms per loop
> >>> timeit ne.evaluate("3*a+b-(a/c)")
> 10 loops, best of 3: 42.8 ms per loop
>
> So, yup, I'm seeing the good speedup here too :-)
>

Great!

<snip>
>
> Well, see the timings for the non-broadcasting case:
>
> >>> a = np.random.random((50,50,50,10))
> >>> b = np.random.random((50,50,50,10))
> >>> c = np.random.random((50,50,50,10))
>
> >>> timeit 3*a+b-(a/c)
> 10 loops, best of 3: 31.1 ms per loop
> >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
> 10 loops, best of 3: 24.5 ms per loop
> >>> timeit ne.evaluate("3*a+b-(a/c)")
> 100 loops, best of 3: 10.4 ms per loop
>
> However, the above comparison is not fair, as numexpr uses all your
> cores by default (2 for the case above).  If we force using only one
> core:
>
> >>> ne.set_num_threads(1)
> >>> timeit ne.evaluate("3*a+b-(a/c)")
> 100 loops, best of 3: 16 ms per loop
>
> which is still faster than luf.  In this case numexpr was not using SSE,
> but in case luf does so, this does not imply better speed.

Ok, I get pretty close to the same ratios (and my machine feels a bit
slow...):

In [6]: timeit 3*a+b-(a/c)
10 loops, best of 3: 101 ms per loop

In [7]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 53.4 ms per loop

In [8]: timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 27.8 ms per loop

In [9]: ne.set_num_threads(1)
In [10]: timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 33.6 ms per loop

I think the closest to a "memcpy" we can do here would be just adding, which
shows the expression evaluation can be estimated to have 20% overhead.
 While that's small compared the speedup over straight NumPy, I think it's
still worth considering.

In [11]: timeit ne.evaluate("a+b+c")
10 loops, best of 3: 27.9 ms per loop

Even just switching from add to divide gives more than 10% overhead.  With
SSE2 these divides could be done two at a time for doubles or four at a time
for floats to cut that down.

In [12]: timeit ne.evaluate("a/b/c")
10 loops, best of 3: 31.7 ms per loop

This all shows that the 'luf' Python interpreter overhead is still pretty
big, the new iterator can't defeat numexpr by itself.  I think numexpr could
get a nice boost from using the new iterator internally though - if I go
back to the original motivation, different memory orderings, 'luf' is 10x
faster than single-threaded numexpr.

In [15]: a = np.random.random((50,50,50,10)).T
In [16]: b = np.random.random((50,50,50,10)).T
In [17]: c = np.random.random((50,50,50,10)).T

In [18]: timeit ne.evaluate("3*a+b-(a/c)")
1 loops, best of 3: 556 ms per loop

In [19]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 52.5 ms per loop

Cheers,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20101222/78b9d531/attachment.html>