[Numpy-discussion] NEP for faster ufuncs
mwwiebe at gmail.com
Wed Dec 22 12:21:28 EST 2010
On Wed, Dec 22, 2010 at 9:07 AM, Francesc Alted <faltet at pytables.org> wrote:
> A Wednesday 22 December 2010 17:25:13 Mark Wiebe escrigué:
> > Can you print out your np.__version__, and try running the tests? If
> > newiter didn't build for some reason, its tests should be throwing a
> > bunch of exceptions.
> I'm a bit swamped now. Let's see if I can do that later on.
> > I see :-) Well, I'd think that numexpr is not specially efficient
> > > when handling broadcasting, so this might be the reason your
> > > approach is faster. I suppose that with operands with the same
> > > shape, things might look different.
> > I haven't looked at the numexpr code, but I think the ufuncs will
> > need SSE versions to make up part of the remaining difference.
> Uh, I doubt that SSE can do a lot for accelerating operations like
> 3*a+b-(a/c), as this computation is mainly bounded by memory (although
> threading does certainly help). Numexpr can use SSE only via Intel's
> VML, which is very good for accelerating the computation of
> transcendental functions (sin, cos, sqrt, exp, log...).
The reason I think it might help is that with 'luf' is that it's calculating
the expression on smaller sized arrays, which possibly just got buffered.
If the memory allocator for the temporaries keeps giving back the same
addresses, all this will be in one of the caches very close to the CPU.
Unless this cache is still too slow to feed the SSE instructions, there
should be a speed benefit. The ufunc inner loops could also use the SSE
prefetch instructions based on the stride to give some strong hints about
where the next memory bytes to use will be.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion