On Wed, Dec 22, 2010 at 9:07 AM, Francesc Alted <faltet@pytables.org> wrote:
A Wednesday 22 December 2010 17:25:13 Mark Wiebe escrigué:
Can you print out your np.__version__, and try running the tests? If newiter didn't build for some reason, its tests should be throwing a bunch of exceptions.
I'm a bit swamped now. Let's see if I can do that later on.
Ok.
I see :-) Well, I'd think that numexpr is not specially efficient
when handling broadcasting, so this might be the reason your approach is faster. I suppose that with operands with the same shape, things might look different.
I haven't looked at the numexpr code, but I think the ufuncs will need SSE versions to make up part of the remaining difference.
Uh, I doubt that SSE can do a lot for accelerating operations like 3*a+b-(a/c), as this computation is mainly bounded by memory (although threading does certainly help). Numexpr can use SSE only via Intel's VML, which is very good for accelerating the computation of transcendental functions (sin, cos, sqrt, exp, log...).
The reason I think it might help is that with 'luf' is that it's calculating the expression on smaller sized arrays, which possibly just got buffered. If the memory allocator for the temporaries keeps giving back the same addresses, all this will be in one of the caches very close to the CPU. Unless this cache is still too slow to feed the SSE instructions, there should be a speed benefit. The ufunc inner loops could also use the SSE prefetch instructions based on the stride to give some strong hints about where the next memory bytes to use will be. -Mark