[Numpy-discussion] experiments with SSE vectorization

Christopher Jordan-Squire cjordan1 at uw.edu
Fri May 17 00:18:55 EDT 2013


I'd been under the impression that the easiest way to get SSE support
was to have numpy use an optimized blas/lapack. Is that not the case?

On Thu, May 16, 2013 at 10:42 AM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> Hi,
> I have been experimenting a bit with how applicable SSE vectorization is
> to NumPy.
> In principle the core of NumPy mostly deals with memory bound
> operations, but it turns out on modern machines with large caches you
> can still get decent speed ups.
>
> The experiments are available on this fork:
> https://github.com/juliantaylor/numpy/tree/simd-experiments
> It includes a simple benchmark 'npbench.py' in the top level.
> No runtime detection is used, it is only enabled on amd64 systems(which
> always has SSE2).
>
> The simd-experiments branch vectorizes the sqrt, basic math operations
> and min/max reductions.
> For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt).
> For double it is around 1.2 - 2, depending on the cpu.
> My Phenom(tm) II X4 955 retains a good speedup even for very large
> datasizes but on intel cpus (xeon and core2duo) you don't gain anything
> if the data is larger than the L3 cache.
> The vectorized version was never slower on phenom and xeon.
> But on a core2duo the normal addition with very large datasets got 10%
> slower. This can be compensated by using aligned load operations, but
> its not implemented yet.
> I'm interested in your results of npbench.py command on other cpus, so
> if you want to try it please send me the output (include /proc/cpuinfo)
>
> The code is a little rough, it can probably be cleaned up a bit by
> adapting the code generator used.
> Would this be something worth including in NumPy?
>
> Further vectorization targets on my todo list are things like
> std/var/mean, basically anything that has a high computation/memory
> ration, suggestions are welcome.
>
>
> Here the detailed results for my phenom:
> float32 datasize (2MB)
> operation:                         speedup
> np.float32 np.max(d)                 3.04
> np.float32 np.min(d)                  3.1
> np.float32 np.sum(d)                 3.02
> np.float32 np.prod(d)                3.04
> np.float32 np.add(1, d)              1.44
> np.float32 np.add(d, 1)              1.45
> np.float32 np.divide(1, d)           3.41
> np.float32 np.divide(d, 1)           3.41
> np.float32 np.divide(d, d)           3.42
> np.float32 np.add(d, d)              1.42
> np.float32 np.multiply(d, d)         1.43
> np.float32 np.sqrt(d)                4.26
>
> float64 datasize (4MB)
> operation:                         speedup
> np.float64 np.max(d)                    2
> np.float64 np.min(d)                 1.89
> np.float64 np.sum(d)                 1.62
> np.float64 np.prod(d)                1.63
> np.float64 np.add(1, d)              1.08
> np.float64 np.add(d, 1)             0.993
> np.float64 np.divide(1, d)           1.83
> np.float64 np.divide(d, 1)           1.74
> np.float64 np.divide(d, d)            1.8
> np.float64 np.add(d, d)              1.02
> np.float64 np.multiply(d, d)         1.05
> np.float64 np.sqrt(d)                2.22
>
> attached the results for intel cpus.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list