Mailman 3 experiments with SSE vectorization - NumPy-Discussion

16 May 2013

      Hi,
I have been experimenting a bit with how applicable SSE vectorization is
to NumPy.
In principle the core of NumPy mostly deals with memory bound
operations, but it turns out on modern machines with large caches you
can still get decent speed ups.

The experiments are available on this fork:
https://github.com/juliantaylor/numpy/tree/simd-experiments
It includes a simple benchmark 'npbench.py' in the top level.
No runtime detection is used, it is only enabled on amd64 systems(which
always has SSE2).

The simd-experiments branch vectorizes the sqrt, basic math operations
and min/max reductions.
For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt).
For double it is around 1.2 - 2, depending on the cpu.
My Phenom(tm) II X4 955 retains a good speedup even for very large
datasizes but on intel cpus (xeon and core2duo) you don't gain anything
if the data is larger than the L3 cache.
The vectorized version was never slower on phenom and xeon.
But on a core2duo the normal addition with very large datasets got 10%
slower. This can be compensated by using aligned load operations, but
its not implemented yet.
I'm interested in your results of npbench.py command on other cpus, so
if you want to try it please send me the output (include /proc/cpuinfo)

The code is a little rough, it can probably be cleaned up a bit by
adapting the code generator used.
Would this be something worth including in NumPy?

Further vectorization targets on my todo list are things like
std/var/mean, basically anything that has a high computation/memory
ration, suggestions are welcome.

Here the detailed results for my phenom:
float32 datasize (2MB)
operation:                         speedup
np.float32 np.max(d)                 3.04
np.float32 np.min(d)                  3.1
np.float32 np.sum(d)                 3.02
np.float32 np.prod(d)                3.04
np.float32 np.add(1, d)              1.44
np.float32 np.add(d, 1)              1.45
np.float32 np.divide(1, d)           3.41
np.float32 np.divide(d, 1)           3.41
np.float32 np.divide(d, d)           3.42
np.float32 np.add(d, d)              1.42
np.float32 np.multiply(d, d)         1.43
np.float32 np.sqrt(d)                4.26

float64 datasize (4MB)
operation:                         speedup
np.float64 np.max(d)                    2
np.float64 np.min(d)                 1.89
np.float64 np.sum(d)                 1.62
np.float64 np.prod(d)                1.63
np.float64 np.add(1, d)              1.08
np.float64 np.add(d, 1)             0.993
np.float64 np.divide(1, d)           1.83
np.float64 np.divide(d, 1)           1.74
np.float64 np.divide(d, d)            1.8
np.float64 np.add(d, d)              1.02
np.float64 np.multiply(d, d)         1.05
np.float64 np.sqrt(d)                2.22

attached the results for intel cpus.

experiments with SSE vectorization

Julian Taylor

Christopher Jordan-Squire

Nathaniel Smith

tags

participants (3)