[Numpy-discussion] Rookie problems - Why is C-code much faster?

Tue Feb 21 11:12:04 EST 2006

Mads Ipsen wrote:

>On Tue, 21 Feb 2006, Zachary Pincus wrote:
>
>  
>
>>Mads,
>>
>>The game with numpy, just as it is with Matlab or any other
>>interpreted numeric environment, is to try push as much of the
>>looping down into the C code as you can. This is because, as you now
>>know, compiled C can loop much faster than interpreted python.
>>
>>A simple example for averaging 1000 (x,y,z) points:
>>
>>print data.shape
>>(1000, 3)
>># bad: explicit for loop in python
>>avg = numpy.zeros(3, numpy.float_)
>>for i in data: avg += i
>>avg /= 1000.0
>>
>># good: implicit for loop in C
>>avg = numpy.add.reduce(data, axis = 0)
>>avg /= 1000.0
>>
>>In your case, instead of explicitly looping through each point, why
>>not do the calculations in parallel, operating on entire vectors of
>>points at one time? Then the looping is "pushed down" into compiled C
>>code. Or if you're really lucky, it's pushed all the way down to the
>>vector math units on your cpu if you have a good BLAS or whatever
>>installed.
>>
>>Zach Pincus
>>
>>Program in Biomedical Informatics and Department of Biochemistry
>>Stanford University School of Medicine
>>
>>
>>
>>    
>>
>>>2. Here is the main loop for finding all possible pair distances,
>>>   which corresponds to a loop over the upper triangular part of a
>>>   square matrix
>>>
>>>     # Loop over all particles
>>>     for i in range(n-1):
>>>        dx = x[i+1:] - x[i]
>>>        dy = y[i+1:] - y[i]
>>>
>>>        dx -= box*rint(dx/box)
>>>        dy -= box*rint(dy/box)
>>>
>>>        r2 = dx**2 + dy**2      # square of dist. between points
>>>
>>>where x and y contain the positions of the particles. A naive
>>>implementation in C is
>>>
>>>
>>>    // loop over all particles
>>>    for (int i=0; i<n-1; i++) {
>>>        for (int j=i+1; j<n; j++) {
>>>            dx = x[j] - x[i];
>>>            dy = y[j] - y[i];
>>>
>>>            dx -= box*rint(dx/box);
>>>            dy -= box*rint(dy/box);
>>>
>>>            r2 = dx*dx + dy*dy;
>>>        }
>>>    }
>>>
>>>For n = 2500 particles, i.e. 3123750 particle pairs, the C loop is
>>>app. 10 times faster than the Python/Numeric counterpart. This is of
>>>course not satisfactory.
>>>
>>>Are there any things I am doing completely wrong here, basic
>>>approaches completely misunderstood, misuses etc?
>>>
>>>Any suggestions, guidelines, hints are most welcome.
>>>
>>>Best regards,
>>>
>>>Mads Ipsen
>>>
>>>
>>>+---------------------------------+-------------------------+
>>>| Mads Ipsen                      |                         |
>>>| Dept. of Chemistry              | phone:     +45-35320220 |
>>>| H.C.Ørsted Institute            | fax:       +45-35320322 |
>>>| Universitetsparken 5            |                         |
>>>| DK-2100 Copenhagen Ø, Denmark   | mpi at osc.kiku.dk         |
>>>+---------------------------------+-------------------------+
>>>
>>>
>>>-------------------------------------------------------
>>>This SF.net email is sponsored by: Splunk Inc. Do you grep through
>>>log files
>>>for problems?  Stop!  Download the new AJAX search engine that makes
>>>searching your log files as easy as surfing the  web.  DOWNLOAD
>>>SPLUNK!
>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
>>>_______________________________________________
>>>Numpy-discussion mailing list
>>>Nump
>>>      
>>>
>
>I agree completely with your comments. But, as you can, the innermost part
>of the loop has been removed in the code, and replaced with numpy slices.
>It's hard for me to see how to compress the outer loop as well, since it
>determines the ranges for the inner loop. Unless there is some fancy slice
>notation, that allows you to loop over a triangular part of a matrix, ie.
>
>    x[i] = sum(A[i+1,i])
>
>meaning x[i] = sum of elements in i'th row of A using only elements from
>position i+1 up to n.
>
>Of course, there is the possibility of hardcoding this in C and then make
>it available as a Python module. But I don't want to do this before I am
>sure there isn't a numpy way out this.
>
>Let me know, if you have any suggestions.
>

Can you explain a little more about what you are trying to calculate? 
The bit about subtracting off  box*rint(dx/box) is a little odd. It 
almost seems like you should be able to do something with fmod, but I 
admit that I'm not sure how.

If I had to guess as to source of the relative slowness I'd say it's 
because you are creating a lot of temporary matrices. There are ways to 
avoid this, but when taken to the extreme, they make your code look 
ugly. You might try the following, untested, code or some variation and 
see if it speeds things up. This makes extensive use of the little known 
optional destination argument for ufuncs. I only tend to do this sort of 
stuff where it's very critical since, as you can see, it makes things 
quite ugly.

     dx_space = x.copy()
     dy_space = y.copy()
     scratch_space = x.copy()
     for i in range(n-1):
        dx = dx_space[i+1:]
        dy = dy_space[i+1:]
        scratch = scratch_space[i+1:]
        subtract(x[i+1:], x[i], dx)
        subtract(y[i+1:], y[i], dy) 
        #   dx -= box*rint(dx/box)
        divide(dx, box, scratch)
        rint(scratch, scratch)
        scratch *= box
        dx -= scratch
        # dy -= box*rint(dy/box)
        divide(dy, box, scratch)
        rint(scratch, scratch)
        scratch *= box
        dy -= scratch
        r2 = dx**2 + dy**2      # square of dist. between points

Hope that helps:

-tim