[Numpy-discussion] Performance problems with strided arrays in NumPy

Wed Apr 19 14:49:02 EDT 2006

On Tue, Apr 18, 2006 at 09:01:54PM -0600, Travis Oliphant wrote:
> faltet at xot.carabos.com wrote:
> The source of this slowness is the use in numarray of  special-cases for 
> certain-sized byte-copies.
> 
> Apparently,  it is *much* faster to do
> 
> ((double *)dst)[0] = ((double *)src)[0]
> 
> when you have aligned data than it is to do
> 
> memmove(dst, src, sizeof(double))

Mmm.. very interesting.

> My timings for your benchmark with current SVN of NumPy are:
> 
> NumPy: [0.021701812744140625, 0.021739959716796875, 0.021548032760620117]
> Numarray: [0.052516937255859375, 0.052685976028442383, 0.052355051040649414]

Well, in my machine and using numpy SVN version:

numpy: [0.0974161624908447, 0.0621590614318847, 0.0612149238586425]
numarray: [0.0658359527587890, 0.0623040199279785, 0.0627131462097167]

So, numpy and numarray exhibits same performance now (it's curious why
you are actually getting better performance in your platform). However:

In [25]: stnac=timeit.Timer('b=a.copy()','import numarray as np;
a=np.arange(1000000,dtype="complex128")[::10]')

In [26]: stnpc=timeit.Timer('b=a.copy()','import numpy as np;
a=np.arange(1000000,dtype="complex128")[::10]')

In [27]: stnac.repeat(3,10)
Out[27]: [0.11303496360778809, 0.11540508270263672, 0.11556506156921387]

In [28]: stnpc.repeat(3,10)
Out[28]: [0.21353006362915039, 0.21468400955200195, 0.21390914916992188]

So, it seems that you forgot optimizing complex types. Fortunately,
the cure is easy; after adding the attached patch I'm getting:

In [3]: stnpc.repeat(3,10)
Out[3]: [0.10468602180480957, 0.10204982757568359, 0.10242295265197754]

so, good performance for numpy in copying strided complex128 is
achieved as well.

Thanks for looking into this!

Francesc

======================================================================

--- numpy/core/src/arrayobject.c        (revision 2381)
+++ numpy/core/src/arrayobject.c        (working copy)
@@ -629,6 +629,14 @@
         char *tout = dst;
         char *tin = src;
         switch(elsize) {
+        case 16:
+                for (i=0; i<N; i++) {
+                        ((Float64 *)tout)[0] = ((Float64 *)tin)[0];
+                        ((Float64 *)tout)[1] = ((Float64 *)tin)[1];
+                        tin = tin + instrides;
+                        tout = tout + outstrides;
+                }
+                return;
         case 8:
                 for (i=0; i<N; i++) {
                         ((Float64 *)tout)[0] = ((Float64 *)tin)[0];