[Numpy-discussion] Performance problems with strided arrays in NumPy
Travis Oliphant
oliphant at ee.byu.edu
Wed Apr 19 20:44:02 EDT 2006
faltet at xot.carabos.com wrote:
>On Wed, Apr 19, 2006 at 09:48:14PM +0000, faltet at xot.carabos.com wrote:
>
>
>>On Tue, Apr 18, 2006 at 09:01:54PM -0600, Travis Oliphant wrote:
>>
>>
>>>Apparently, it is *much* faster to do
>>>
>>>((double *)dst)[0] = ((double *)src)[0]
>>>
>>>when you have aligned data than it is to do
>>>
>>>memmove(dst, src, sizeof(double))
>>>
>>>
>>Mmm.. very interesting.
>>
>>
>
>A follow-up on this. After analyzing somewhat the issue, it seems that
>the problem with the memcpy() version was not the call itself, but the
>parameter that was passed as the number of bytes to copy. As this was a
>parameter whose value was unknown in compile time, the compiler cannot
>generate optimized code for it and always has to fetch its value from
>memory (or cache).
>
>
>In the version of the code that you optimized, you managed to do this
>because you are telling to the compiler (i.e. specifying at compile
>time) the exact extend of the data copy, so allowing it to generate
>optimum code for the copy operation. However, if you do a similar
>thing but using the call (using doubles here):
>
>memcpy(tout, tin, 8);
>
>instead of:
>
>((Float64 *)tout)[0] = ((Float64 *)tin)[0];
>
>and repeat the operation for the other types, then you can achieve
>similar performance than the pointer version.
>
>
This is good to know. It certainly makes sense. I'll test it on my
system when I get back.
>On another hand, I see that you have disabled the optimization for
>unaligned data through the use of a check. Is there any reason for
>doing that? If I remove this check, I can achieve similar performance
>than for numarray (a bit better, in fact).
>
>
The only reason was to avoid pointer dereferencing on misaligned data
(dereferencing a misaligned pointer causes bus errors on Solaris).
But, if we can achieve it with a memmove, then there is no reason to
limit the code.
>I'm attaching a small benchmark script that compares the performance
>of copying a 1D vector of 1 million of elements in contiguous, strided
>(2 and 10), and strided (2 and 10 again) & unaligned flavors. The
>results for my machine (p4 at 2 GHz) are:
>
>For the original numpy code (i.e. before Travis optimization):
>
>time for numpy contiguous --> 0.234
>time for numarray contiguous --> 0.229
>time for numpy strided (2) --> 1.605
>time for numarray strided (2) --> 0.263
>time for numpy strided (10) --> 1.72
>time for numarray strided (10) --> 0.264
>time for numpy strided (2) & unaligned--> 1.736
>time for numarray strided (2) & unaligned--> 0.402
>time for numpy strided (10) & unaligned--> 1.872
>time for numarray strided (10) & unaligned--> 0.435
>
>where you can see that, for 1e6 elements the slowdown of original
>numpy is almost 7x (!). Remember that in the previous benchmarks sent
>here the slowdown was 3x, but we were copying 10 times less data.
>
>For the pointer optimised code (i.e. the current SVN version):
>
>time for numpy contiguous --> 0.238
>time for numarray contiguous --> 0.232
>time for numpy strided (2) --> 0.214
>time for numarray strided (2) --> 0.264
>time for numpy strided (10) --> 0.299
>time for numarray strided (10) --> 0.262
>time for numpy strided (2) & unaligned--> 1.736
>time for numarray strided (2) & unaligned--> 0.401
>time for numpy strided (10) & unaligned--> 1.874
>time for numarray strided (10) & unaligned--> 0.433
>
>here you can see that your figures are very similar to numarray except
>for unaligned data (4x slower).
>
>For the pointer optimised code but releasing the unaligned data check:
>
>time for numpy contiguous --> 0.236
>time for numarray contiguous --> 0.231
>time for numpy strided (2) --> 0.213
>time for numarray strided (2) --> 0.262
>time for numpy strided (10) --> 0.297
>time for numarray strided (10) --> 0.261
>time for numpy strided (2) & unaligned--> 0.263
>time for numarray strided (2) & unaligned--> 0.403
>time for numpy strided (10) & unaligned--> 0.452
>time for numarray strided (10) & unaligned--> 0.432
>
>Ei! numpy is very similar to numarray in all cases, except for the
>strided with 2 elements and unaligned case, where numpy performs a 50%
>better.
>
>Finally, and just for showing the effect of providing memcpy with size
>information in compilation time, the numpy code using memcpy() with
>this optimization on (and disabling the alignment check, of course!):
>
>time for numpy contiguous --> 0.234
>time for numarray contiguous --> 0.233
>time for numpy strided (2) --> 0.223
>time for numarray strided (2) --> 0.262
>time for numpy strided (10) --> 0.285
>time for numarray strided (10) --> 0.262
>time for numpy strided (2) & unaligned--> 0.261
>time for numarray strided (2) & unaligned--> 0.401
>time for numpy strided (10) & unaligned--> 0.42
>time for numarray strided (10) & unaligned--> 0.436
>
>you can see that the figures are very similar to the previous case. So
>Travis, you may want to use the pointer indirection approach or the
>memcpy() one, whichever you prefer.
>
>Well, I just wanted to point this out. Time for sleep!
>
>
>
Very, very useful information. 1000 Thank you's for talking the time to
investigate and assemble it. Do you think the memmove would work
similarly?
-Travis
More information about the NumPy-Discussion
mailing list