[Numpy-discussion] Performance problems with strided arrays in NumPy

Wed Apr 19 20:44:02 EDT 2006

faltet at xot.carabos.com wrote:

>On Wed, Apr 19, 2006 at 09:48:14PM +0000, faltet at xot.carabos.com wrote:
>  
>
>>On Tue, Apr 18, 2006 at 09:01:54PM -0600, Travis Oliphant wrote:
>>    
>>
>>>Apparently,  it is *much* faster to do
>>>
>>>((double *)dst)[0] = ((double *)src)[0]
>>>
>>>when you have aligned data than it is to do
>>>
>>>memmove(dst, src, sizeof(double))
>>>      
>>>
>>Mmm.. very interesting.
>>    
>>
>
>A follow-up on this.  After analyzing somewhat the issue, it seems that
>the problem with the memcpy() version was not the call itself, but the
>parameter that was passed as the number of bytes to copy. As this was a
>parameter whose value was unknown in compile time, the compiler cannot
>generate optimized code for it and always has to fetch its value from
>memory (or cache).
>  
>
>In the version of the code that you optimized, you managed to do this
>because you are telling to the compiler (i.e. specifying at compile
>time) the exact extend of the data copy, so allowing it to generate
>optimum code for the copy operation. However, if you do a similar
>thing but using the call (using doubles here):
>
>memcpy(tout, tin, 8);
>
>instead of:
>
>((Float64 *)tout)[0] = ((Float64 *)tin)[0];
>
>and repeat the operation for the other types, then you can achieve
>similar performance than the pointer version.
>  
>

This is good to know.  It certainly makes sense.  I'll test it on my 
system when I get back.

>On another hand, I see that you have disabled the optimization for
>unaligned data through the use of a check. Is there any reason for
>doing that?  If I remove this check, I can achieve similar performance
>than for numarray (a bit better, in fact).
>  
>
The only reason was to avoid pointer dereferencing on misaligned data 
(dereferencing a misaligned pointer causes bus errors on Solaris).   
But, if we can achieve it with a memmove, then there is no reason to 
limit the code.

>I'm attaching a small benchmark script that compares the performance
>of copying a 1D vector of 1 million of elements in contiguous, strided
>(2 and 10), and strided (2 and 10 again) & unaligned flavors. The
>results for my machine (p4 at 2 GHz) are:
>
>For the original numpy code (i.e. before Travis optimization):
>
>time for numpy contiguous --> 0.234
>time for numarray contiguous --> 0.229
>time for numpy strided (2) --> 1.605
>time for numarray strided (2) --> 0.263
>time for numpy strided (10) --> 1.72
>time for numarray strided (10) --> 0.264
>time for numpy strided (2) & unaligned--> 1.736
>time for numarray strided (2) & unaligned--> 0.402
>time for numpy strided (10) & unaligned--> 1.872
>time for numarray strided (10) & unaligned--> 0.435
>
>where you can see that, for 1e6 elements the slowdown of original
>numpy is almost 7x (!). Remember that in the previous benchmarks sent
>here the slowdown was 3x, but we were copying 10 times less data.
>
>For the pointer optimised code (i.e. the current SVN version):
>
>time for numpy contiguous --> 0.238
>time for numarray contiguous --> 0.232
>time for numpy strided (2) --> 0.214
>time for numarray strided (2) --> 0.264
>time for numpy strided (10) --> 0.299
>time for numarray strided (10) --> 0.262
>time for numpy strided (2) & unaligned--> 1.736
>time for numarray strided (2) & unaligned--> 0.401
>time for numpy strided (10) & unaligned--> 1.874
>time for numarray strided (10) & unaligned--> 0.433
>
>here you can see that your figures are very similar to numarray except
>for unaligned data (4x slower).
>
>For the pointer optimised code but releasing the unaligned data check:
>
>time for numpy contiguous --> 0.236
>time for numarray contiguous --> 0.231
>time for numpy strided (2) --> 0.213
>time for numarray strided (2) --> 0.262
>time for numpy strided (10) --> 0.297
>time for numarray strided (10) --> 0.261
>time for numpy strided (2) & unaligned--> 0.263
>time for numarray strided (2) & unaligned--> 0.403
>time for numpy strided (10) & unaligned--> 0.452
>time for numarray strided (10) & unaligned--> 0.432
>
>Ei! numpy is very similar to numarray in all cases, except for the
>strided with 2 elements and unaligned case, where numpy performs a 50%
>better.
>
>Finally, and just for showing the effect of providing memcpy with size
>information in compilation time, the numpy code using memcpy() with
>this optimization on (and disabling the alignment check, of course!):
>
>time for numpy contiguous --> 0.234
>time for numarray contiguous --> 0.233
>time for numpy strided (2) --> 0.223
>time for numarray strided (2) --> 0.262
>time for numpy strided (10) --> 0.285
>time for numarray strided (10) --> 0.262
>time for numpy strided (2) & unaligned--> 0.261
>time for numarray strided (2) & unaligned--> 0.401
>time for numpy strided (10) & unaligned--> 0.42
>time for numarray strided (10) & unaligned--> 0.436
>
>you can see that the figures are very similar to the previous case. So
>Travis, you may want to use the pointer indirection approach or the
>memcpy() one, whichever you prefer.
>
>Well, I just wanted to point this out. Time for sleep!
>
>  
>
Very, very useful information.  1000 Thank you's for talking the time to 
investigate and assemble it.   Do you think the memmove would work 
similarly?  

-Travis