[Numpy-discussion] Vectorizing code, for loops, and all that

Tue Oct 3 12:12:37 EDT 2006

Travis Oliphant wrote:
> Albert Strasheim wrote:
>   
>>>> [1] 12.97% of function time
>>>> [2] 8.65% of functiont ime
>>>> [3] 62.14% of function time
>>>>
>>>> If statistics from elsewhere in the code would be helpful, let me 
>>>> know,
>>>>       
>>>>         
>>> and
>>>     
>>>       
>>>> I'll see if I can convince Quantify to cough it up.
>>>>
>>>>       
>>>>         
>>> Please run the same test but using
>>>
>>> x1 = N.random.rand(39,2000)
>>> x2 = N.random.rand(39,64,1)
>>>
>>> z1 = x1[:,N.newaxis,:] - x2
>>>     
>>>       
>> Very similar results to what I had previously:
>>
>> [1] 10.88%
>> [2] 7.25%
>> [3] 68.25%
>>
>>   
>>     
> Thanks,
>
> I've got some ideas about how to speed this up by eliminating some of 
> the unnecessary calculations  going on outside of the function loop, but 
> there will still be some speed issues depending on how the array is 
> traversed once you get above a certain size.   I'm not sure there anyway 
> around that, ultimately, due to memory access being slow on most hardware. 
>
> If anyone has any ideas, I'd love to hear them.    I won't be able to 
> get to implementing my ideas until at least Friday (also when rc2 will 
> be released).
>
>   
I had an idea regarding which axis to operate on first. Rather than 
operate on strictly the longest axis or strictly the innermost axis, a 
hybrid approach could be used. We would operate on the longest axis that 
would not result in the inner loop overflowing the cache. The idea is 
minimize the loop overhead as we do now by choosing the largest axis, 
while at the same time attempting to maintain cache friendliness.

Let's suppose that it's safe to stuff 2000 array elements into the 
cache[1], and lets consider two arrays where the lengths of the axes[2], 
from innermost to outermost are [10,100,1000] and [100,10,1000]. Under 
this scheme, we would loop over the length 100 axis in both cases. 
Looping over the length-1000 axis pull 1e6 elements into the cache. If 
the innermost axis alone exceeded our cache imposed limit, we would of 
course just loop over that axis first.

-tim

[1] This would really be specified in bytes and thus the number of 
allowable elements would vary by data type. It might also need to be 
tuned based on architecture.
[2] I believe that there's some consolidation of adjacent axes that goes 
on where possible, let's assume that's already happened at this point.