[Numpy-discussion] experiments with SSE vectorization

Julian Taylor jtaylor.debian at googlemail.com
Thu May 16 13:42:13 EDT 2013

I have been experimenting a bit with how applicable SSE vectorization is
to NumPy.
In principle the core of NumPy mostly deals with memory bound
operations, but it turns out on modern machines with large caches you
can still get decent speed ups.

The experiments are available on this fork:
It includes a simple benchmark 'npbench.py' in the top level.
No runtime detection is used, it is only enabled on amd64 systems(which
always has SSE2).

The simd-experiments branch vectorizes the sqrt, basic math operations
and min/max reductions.
For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt).
For double it is around 1.2 - 2, depending on the cpu.
My Phenom(tm) II X4 955 retains a good speedup even for very large
datasizes but on intel cpus (xeon and core2duo) you don't gain anything
if the data is larger than the L3 cache.
The vectorized version was never slower on phenom and xeon.
But on a core2duo the normal addition with very large datasets got 10%
slower. This can be compensated by using aligned load operations, but
its not implemented yet.
I'm interested in your results of npbench.py command on other cpus, so
if you want to try it please send me the output (include /proc/cpuinfo)

The code is a little rough, it can probably be cleaned up a bit by
adapting the code generator used.
Would this be something worth including in NumPy?

Further vectorization targets on my todo list are things like
std/var/mean, basically anything that has a high computation/memory
ration, suggestions are welcome.

Here the detailed results for my phenom:
float32 datasize (2MB)
operation:                         speedup
np.float32 np.max(d)                 3.04
np.float32 np.min(d)                  3.1
np.float32 np.sum(d)                 3.02
np.float32 np.prod(d)                3.04
np.float32 np.add(1, d)              1.44
np.float32 np.add(d, 1)              1.45
np.float32 np.divide(1, d)           3.41
np.float32 np.divide(d, 1)           3.41
np.float32 np.divide(d, d)           3.42
np.float32 np.add(d, d)              1.42
np.float32 np.multiply(d, d)         1.43
np.float32 np.sqrt(d)                4.26

float64 datasize (4MB)
operation:                         speedup
np.float64 np.max(d)                    2
np.float64 np.min(d)                 1.89
np.float64 np.sum(d)                 1.62
np.float64 np.prod(d)                1.63
np.float64 np.add(1, d)              1.08
np.float64 np.add(d, 1)             0.993
np.float64 np.divide(1, d)           1.83
np.float64 np.divide(d, 1)           1.74
np.float64 np.divide(d, d)            1.8
np.float64 np.add(d, d)              1.02
np.float64 np.multiply(d, d)         1.05
np.float64 np.sqrt(d)                2.22

attached the results for intel cpus.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: results.tar.gz
Type: application/gzip
Size: 8668 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130516/038b759a/attachment.bin>

More information about the NumPy-Discussion mailing list