Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

May 22, 2009


      A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
...
dmitrey schrieb:
...
hi all,
has anyone already tried to compare using an ordinary numpy ufunc vs
that one from corepy, first of all I mean the project
http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1
24024628235
It would be interesting to know what is speedup for (eg) vec ** 0.5 or
(if it's possible - it isn't pure ufunc) numpy.dot(Matrix, vec). Or
any another example.
I have no experience with the mentioned CorePy, but recently I was
playing around with accelerated ufuncs using Intels Math Kernel Library
(MKL). These improvements are now part of the numexpr package
http://code.google.com/p/numexpr/
Some remarks on possible speed improvements on recent Intel x86 processors.
1) basic arithmetic ufuncs (add, sub, mul, ...) in standard numpy are
fast (SSE is used) and speed is limited by memory bandwidth.
2) the speed of many transcendental functions (exp, sin, cos, pow, ...)
can be improved by _roughly_ a factor of five (single core) by using the
MKL. Most of the improvements stem from using faster algorithms with a
vectorized implementation. Note: the speed improvement depends on a
_lot_ of other circumstances.
3) Improving performance by using multi cores is much more difficult.
Only for sufficiently large (>1e5) arrays a significant speedup is
possible. Where a speed gain is possible, the MKL uses several cores.
Some experimentation showed that adding a few OpenMP constructs you
could get a similar speedup with numpy.
4) numpy.dot uses optimized implementations.
Good points Gregor.  However, I wouldn't say that improving performance by 
using multi cores is *that* difficult, but rather that multi cores can only be 
used efficiently *whenever* the memory bandwith is not a limitation.  An 
example of this is the computation of transcendental functions, where, even 
using vectorized implementations, the computation speed is still CPU-bounded 
in many cases.  And you have experimented yourself very good speed-ups for 
these cases with your implementation of numexpr/MKL :)

Cheers,

-- 
Francesc Alted