experiments with SSE vectorization
Hi, I have been experimenting a bit with how applicable SSE vectorization is to NumPy. In principle the core of NumPy mostly deals with memory bound operations, but it turns out on modern machines with large caches you can still get decent speed ups. The experiments are available on this fork: https://github.com/juliantaylor/numpy/tree/simd-experiments It includes a simple benchmark 'npbench.py' in the top level. No runtime detection is used, it is only enabled on amd64 systems(which always has SSE2). The simd-experiments branch vectorizes the sqrt, basic math operations and min/max reductions. For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt). For double it is around 1.2 - 2, depending on the cpu. My Phenom(tm) II X4 955 retains a good speedup even for very large datasizes but on intel cpus (xeon and core2duo) you don't gain anything if the data is larger than the L3 cache. The vectorized version was never slower on phenom and xeon. But on a core2duo the normal addition with very large datasets got 10% slower. This can be compensated by using aligned load operations, but its not implemented yet. I'm interested in your results of npbench.py command on other cpus, so if you want to try it please send me the output (include /proc/cpuinfo) The code is a little rough, it can probably be cleaned up a bit by adapting the code generator used. Would this be something worth including in NumPy? Further vectorization targets on my todo list are things like std/var/mean, basically anything that has a high computation/memory ration, suggestions are welcome. Here the detailed results for my phenom: float32 datasize (2MB) operation: speedup np.float32 np.max(d) 3.04 np.float32 np.min(d) 3.1 np.float32 np.sum(d) 3.02 np.float32 np.prod(d) 3.04 np.float32 np.add(1, d) 1.44 np.float32 np.add(d, 1) 1.45 np.float32 np.divide(1, d) 3.41 np.float32 np.divide(d, 1) 3.41 np.float32 np.divide(d, d) 3.42 np.float32 np.add(d, d) 1.42 np.float32 np.multiply(d, d) 1.43 np.float32 np.sqrt(d) 4.26 float64 datasize (4MB) operation: speedup np.float64 np.max(d) 2 np.float64 np.min(d) 1.89 np.float64 np.sum(d) 1.62 np.float64 np.prod(d) 1.63 np.float64 np.add(1, d) 1.08 np.float64 np.add(d, 1) 0.993 np.float64 np.divide(1, d) 1.83 np.float64 np.divide(d, 1) 1.74 np.float64 np.divide(d, d) 1.8 np.float64 np.add(d, d) 1.02 np.float64 np.multiply(d, d) 1.05 np.float64 np.sqrt(d) 2.22 attached the results for intel cpus.
I'd been under the impression that the easiest way to get SSE support was to have numpy use an optimized blas/lapack. Is that not the case? On Thu, May 16, 2013 at 10:42 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
Hi, I have been experimenting a bit with how applicable SSE vectorization is to NumPy. In principle the core of NumPy mostly deals with memory bound operations, but it turns out on modern machines with large caches you can still get decent speed ups.
The experiments are available on this fork: https://github.com/juliantaylor/numpy/tree/simd-experiments It includes a simple benchmark 'npbench.py' in the top level. No runtime detection is used, it is only enabled on amd64 systems(which always has SSE2).
The simd-experiments branch vectorizes the sqrt, basic math operations and min/max reductions. For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt). For double it is around 1.2 - 2, depending on the cpu. My Phenom(tm) II X4 955 retains a good speedup even for very large datasizes but on intel cpus (xeon and core2duo) you don't gain anything if the data is larger than the L3 cache. The vectorized version was never slower on phenom and xeon. But on a core2duo the normal addition with very large datasets got 10% slower. This can be compensated by using aligned load operations, but its not implemented yet. I'm interested in your results of npbench.py command on other cpus, so if you want to try it please send me the output (include /proc/cpuinfo)
The code is a little rough, it can probably be cleaned up a bit by adapting the code generator used. Would this be something worth including in NumPy?
Further vectorization targets on my todo list are things like std/var/mean, basically anything that has a high computation/memory ration, suggestions are welcome.
Here the detailed results for my phenom: float32 datasize (2MB) operation: speedup np.float32 np.max(d) 3.04 np.float32 np.min(d) 3.1 np.float32 np.sum(d) 3.02 np.float32 np.prod(d) 3.04 np.float32 np.add(1, d) 1.44 np.float32 np.add(d, 1) 1.45 np.float32 np.divide(1, d) 3.41 np.float32 np.divide(d, 1) 3.41 np.float32 np.divide(d, d) 3.42 np.float32 np.add(d, d) 1.42 np.float32 np.multiply(d, d) 1.43 np.float32 np.sqrt(d) 4.26
float64 datasize (4MB) operation: speedup np.float64 np.max(d) 2 np.float64 np.min(d) 1.89 np.float64 np.sum(d) 1.62 np.float64 np.prod(d) 1.63 np.float64 np.add(1, d) 1.08 np.float64 np.add(d, 1) 0.993 np.float64 np.divide(1, d) 1.83 np.float64 np.divide(d, 1) 1.74 np.float64 np.divide(d, d) 1.8 np.float64 np.add(d, d) 1.02 np.float64 np.multiply(d, d) 1.05 np.float64 np.sqrt(d) 2.22
attached the results for intel cpus.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 17 May 2013 05:19, "Christopher Jordan-Squire" <cjordan1@uw.edu> wrote:
I'd been under the impression that the easiest way to get SSE support was to have numpy use an optimized blas/lapack. Is that not the case?
Apples and oranges. That's the easiest (only) way to get SSE support for operations that go through blas/lapack, but there are also lots of operations in numpy that are implemented directly. -n
participants (3)
-
Christopher Jordan-Squire
-
Julian Taylor
-
Nathaniel Smith