[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

Fri Jan 16 09:39:57 EST 2009

Francesc Alted schrieb:
> Numexpr is a fast numerical expression evaluator for NumPy.  With it,
> expressions that operate on arrays (like "3*a+4*b") are accelerated
> and use less memory than doing the same calculation in Python.
>
> The expected speed-ups for Numexpr respect to NumPy are between 0.95x
> and 15x, being 3x or 4x typical values.  The strided and unaligned
> case has been optimized too, so if the expresion contains such arrays, 
> the speed-up can increase significantly.  Of course, you will need to 
> operate with large arrays (typically larger than the cache size of your 
> CPU) to see these improvements in performance.
>
>   
Just recently I had a more detailed look at numexpr. Clever idea, easy 
to use! I can affirm an typical performance gain of 3x if you work on 
large arrays (>100k entries).

I also gave a try to the vector math library (VML), contained in Intel's 
Math Kernel Library. This offers a fast implementation of mathematical 
functions, operating on array. First I implemented a C extension, 
providing new ufuncs. This gave me a big performance gain, e.g., 2.3x 
(5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for 
division (no gain for add, sub, mul). The values in parantheses are 
given if I allow VML to use several threads and to employ both cores of 
my Intel Core2Duo computer. For large arrays (100M entries) this 
performance gain is reduced because of limited memory bandwidth. At this 
point I was stumbling across numexpr and modified it to use the VML 
functions. For sufficiently long and complex  numerical expressions I 
could  get the maximum performance also for large arrays.  Together with 
VML numexpr seems to be a extremely powerful to get an optimum 
performance. I would like to see numexpr extended to (optionally) make 
use of fast vectorized math functions. There is one but: VML supports 
(at the moment) only math on contiguous arrays. At a first try I didn't 
understand how to enforce this limitation in numexpr. I also gave a 
quick try to the equivalent vector math library, acml_mv of AMD. I only 
tried sin and log, gave me the same performance (on a Intel processor!) 
like Intels VML .

I was also playing around with the block size in numexpr. What are the 
rationale that led to the current block size of 128? Especially with 
VML, a larger block size of 4096 instead of 128 allowed  to efficiently 
use multithreading in VML.
> Share your experience
> =====================
>
> Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> have.
>
>   
I was missing the support for single precision floats.

Great work!

Gregor