Performance Python: An introduction.

A beginners guide to using Python for performance computing.

This is a simple introductory document to using Python for performance computing. This should hopefully get you started. We'll use Numeric, scipy.weave (blitz and inline), and also give an example of how to use f2py to use a Fortran subroutine from within Python.

We will also use this opportunity to benchmark the various ways to solve this problem and will compare this to an implementation of the algorithm in simple c++.

Problem description

The example we will consider is a very simple (read, trivial) example of solving the 2D Laplace equation using an iterative finite difference scheme (four point averaging, Gauss-Seidel or Gauss-Jordan). The formal specification of the problem is as follows. We are required to solve for some unknown function u(x,y) such that \nabla^2 u = 0 with a boundary condition specified. For convenience the domain of interest is considered to be a rectangle and the boundary values at the sides of this rectangle are given.

It can be shown that this problem can be solved using a simple four point averaging scheme as follows. Discretise the domain into an (nx x ny) grid of points. Then the function u can be represented as a 2 dimensional array - u(nx, ny). The values of u along the sides of the rectangle are given. The solution can be obtained by iterating in the following manner.

         for i in range(1, nx-1):
            for j in range(1, ny-1):
                u[i,j] = ((u[i-1, j] + u[i+1, j])*dy^2 +
                          (u[i, j-1] + u[i, j+1])*dx^2)/(2.0*(dx^2 + dy^2))

Where dx and dy are the lengths along the x and y axis of the discretised domain.

Numerical Solution

Implementing a solver for this is straight forward in Pure Python using simple a simple Numeric array to store the solution matrix u. The following code demonstrates a simple solver.

#!/usr/bin/env python
from Numeric import *
from math import *

class Grid:    
    """A simple grid class that stores the details and solution of the
    computational grid."""
    def __init__(self, nx=10, ny=10, xmin=0.0, xmax=1.0,
                 ymin=0.0, ymax=1.0):
        self.xmin, self.xmax, self.ymin, self.ymax = xmin, xmax, ymin, ymax
        self.dx = float(xmax-xmin)/(nx-1)
        self.dy = float(ymax-ymin)/(ny-1)
        self.u = zeros((nx, ny), 'd')
        self.old_u = self.u.copy() # used to compute the change in solution

    def setBCFunc(self, func):
        """Sets the Boundary condition given a function of two variables."""
        xmin, ymin = self.xmin, self.ymin
        xmax, ymax = self.xmax, self.ymax
        x = arange(xmin, xmax + self.dx*0.5, self.dx)
        y = arange(ymin, ymax + self.dy*0.5, self.dy)
        self.u[0 ,:] = func(xmin,y)
        self.u[-1,:] = func(xmax,y)
        self.u[:, 0] = func(x,ymin)
        self.u[:,-1] = func(x,ymax)

    def computeError(self):        
        """Computes absolute error using an L2 norm for the solution.
        This requires that self.u and self.old_u must be appropriately
        setup."""        
        v = (self.u - self.old_u).flat
        return sqrt(dot(v,v))
    

class LaplaceSolver:    
    """A simple Laplacian solver that can use different schemes to
    solve the problem."""    
    def __init__(self, grid, stepper='numeric'):
        self.grid = grid

    def timeStep(self, dt=0.0):
        """Takes a time step using straight forward Python loops."""
        g = self.grid
        nx, ny = g.u.shape        
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u

        err = 0.0
        for i in range(1, nx-1):
            for j in range(1, ny-1):
                tmp = u[i,j]
                u[i,j] = ((u[i-1, j] + u[i+1, j])*dy2 +
                          (u[i, j-1] + u[i, j+1])*dx2)*dnr_inv
                diff = u[i,j] - tmp
                err += diff*diff

        return sqrt(err)
        
    def solve(self, n_iter=0, eps=1.0e-16):        
        err = self.timeStep()
        count = 1

        while err > eps:
            if n_iter and count >= n_iter:
                return err
            err = self.timeStep()
            count = count + 1

        return count

The code is pretty simple and very easy to write but if you run it for any sizeable problem (say a 500 x 500 grid of points). You'll find that it takes *forever* to run. The CPU hog in this case will be the timeStep method. In the next section we will speed it up using NumPy.

Using NumPy

It turns out that the innermost loop of the LaplaceSolver.timeStep method can be readily expressed by a much simpler NumPy expression. Here is a re-written timeStep method.

    def timeStep(self, dt=0.0):
        """Takes a time step using a numeric expressions."""
        g = self.grid
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        g.old_u = u.copy() # needed to compute the error.

        # The actual iteration
        u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + 
                         (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv
        
        return g.computeError()

The entire for i and j loops have been replaced by a single Numeric expression. Numeric expressions operate element wise and hence the above expression works. It basically computes the four point average. If you have gone through the numeric tutorial and played with Numeric a bit you should be able to understand how this works. The beauty of the expression is that its completely done in C. This makes the computation *much* faster. For a quick comparison here are some numbers for a single iteration on a 500x500 grid. On a PIII 450Mhz with 192 MB RAM, the above takes about 0.35 seconds whereas the slower one takes a whopping 18.7 seconds. This is already a 53 fold speed increase. You will also note a few things.

We cannot compute the error the way we did earlier inside the for loop. We need to make a copy of the data and then use the computeError function to do this. This costs us memory and is not very pretty. This is certainly a limitation but is worth a 53 fold speed increase.
The expression will use temporaries hence during one iteration, the computed values at a previous index will not be used for the next one. For instance in the original for loop once the value of u[1,1] say is computed, the next value for u[1,2] will use the newly computed u[1,1] and not the old one. However in the numeric expression since it uses temporaries internally only the old value of u[1,1] will be used. This is not a serious issue in this case because it is known that even when this happens the algorithm will converge.

Apart from these two issues its clear that using Numeric boosts speed tremendously. We will now use the amazing scipy.weave to speed this up further.

Using scipy.weave.blitz

The Numeric expression can be speeded up quite a bit if we use weave.blitz. Here is the new function.

# import necessary modules and functions
import scipy.weave

    def timeStep(self, dt=0.0):
        """Takes a time step using a numeric expression that has been
        blitzed using weave."""        
        g = self.grid
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        g.old_u = u.copy()

        # The actual iteration
        expr = "u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + "\
               "(u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv"
        scipy.weave.blitz(expr, check_size=0)

        return g.computeError()

If you notice the only thing that has changed is that we put quotes around the original numeric expression and call this string 'expr' and then invoke scipy.weave.blitz. The check_size keyword when set to 1 does a few sanity checks and is to be used where you are debugging your code. However for pure speed its is wise to set it to 0. This time when we time the code for a 500x500 array for a single iteration it takes only about 0.12 seconds which is about a three fold increase! There are again a few things to note.

The first time round the function call will take a long while to do some magic behind your back. Next time round the code will run immediately. More details on this are is in the weave documentation.
Again we need to use a temporary array to compute the error.
blitz does *not* use temporaries for the computation and therefore behaves more like the original (slow) for loop in that the computed values are re-used immediately.

Apart from these two the results are identical as compared to the original for loop. Its only about 170 times faster than the original code! We will now look at yet another way to speed up our original for loop. Enter scipy.weave.inline!

Using scipy.weave.inline

Inline allows one to embed C code directly into your Python code. So here is a simple version of an inlined version of the code.

    def inlineTimeStep(self, dt=0.0):        
        """Takes a time step using inlined C code -- this version uses
        blitz arrays."""        
        g = self.grid
        nx, ny = g.u.shape
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        
        code = """
               #line 120 "laplace.py" (This is only useful for debugging)
               double tmp, err, diff;
               err = 0.0;
               for (int i=1; i<nx-1; ++i) {
                   for (int j=1; j<ny-1; ++j) {
                       tmp = u(i,j);
                       u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
                                 (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
                       diff = u(i,j) - tmp;
                       err += diff*diff;
                   }
               }
               return_val = Py::new_reference_to(Py::Float(sqrt(err)));
               """
        # compiler keyword only needed on windows with MSVC installed
        err = scipy.weave.inline(code,
                                 ['u', 'dx2', 'dy2', 'dnr_inv', 'nx','ny'],
                                 type_factories = blitz_type_factories,
                                 compiler = 'gcc',
                                 extra_compile_args =
                                 ['-O3','-malign-double','-funroll-loops'])
        return err

The code itself looks very straightforward (which is what makes inline so cool). The inline function call arguments are all self explanatory. The line with '#line 120 ...' is only used for debugging and doesn't affect the speed in anyway. Again the first time you run this function it takes a long while to do something and then next time round it blazes away. This time notice that we have far more flexibility in our loop and can easily compute an error term without an need for temporary arrays. Timing this version results in a time for a 500x500 array of a mere 0.04 seconds per iteration! This corresponds to a whopping 430 fold speed increase over the plain old for loop. And remember we haven't sacrificed any of Pythons incredible flexibility! This loop contains code that looks very nice but if we want to we can speed things up further by writing a little dirty code. We won't get into that here but it suffices to say that its possible to get a further speed up of about 50% by using a different version.

Finally we look at how its possible to easily implement the loop inside Fortran and call it from Python by using f2py.

Using f2py

f2py is an amazing utility that lets you easily call Fortran functions from Python. First we will write a small Fortran subroutine to do our calculation. Here is the code.

c File flaplace.f
      subroutine timestep(u,n,m,dx,dy,error)
      double precision u(n,m)
      double precision dx,dy,dx2,dy2,dnr_inv,tmp,diff
      integer n,m,i,j
cf2py intent(in) :: dx,dy
cf2py intent(in,out) :: u
cf2py intent(out) :: error
cf2py intent(hide) :: n,m
      dx2 = dx*dx
      dy2 = dy*dy
      dnr_inv = 0.5d0 / (dx2+dy2)
      error = 0
      do 200,i=2,n-1
         do 100,j=2,m-1
            tmp = u(i,j)
            u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2+
     &           (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv 
            diff = u(i,j) - tmp
            error = error + diff*diff
 100     continue
 200  continue
      error = sqrt(error)
      end

The lines starting with cf2py are special f2py directives and are documented in f2py. The rest of the code is straightforward for those who know some Fortran. We trivially create a Python module for this using the following command.

      % f2py -c flaplace.f -m flaplace

You need the latest version of f2py to do this. Here is how the Python side of things looks.

import flaplace

    def fortranTimeStep(self, dt=0.0):
        """Takes a time step using a simple fortran module that
        implements the loop in Fortran.  """        
        g = self.grid
        u, err = flaplace.timestep(g.u, g.dx, g.dy)
        return err

Thats it! Hopefully someday scipy.weave will let us do this inline and not require us to write a separate Fortran file. Anyway, using this module it takes about 0.03 seconds for a 500x500 grid per iteration. This is about a 550 fold speed increase over the original code and about 40% faster than the slower version of the inlined C. The faster inlined C code is about 10% faster than the Fortran version though.

Plain old C++

Finally we implemented this in simple C++ (nothing fancy) without any Python just for comparison. One should expect that C will be faster but the surprising thing is that its not by much! Given the fact that its so easy to develop with Python this speed reduction really is not much of an issue. Here are some timing results for a 500x500 grid for 100 iterations.

Type of solution	Time taken (sec)
Slow python (estimate)	1897.0
Numeric	29.0
Blitz	10.2
Inline	4.3
Fast Inline	2.9
Python/Fortran	3.2
Pure C++	2.4

This is pretty amazing considering the flexibility and power of Python.

The file laplace.py incorporates all of these methods so you can test these out for yourselves. Also included are the Fortran code and the pure C++ code that you can compare with.

Prabhu Ramachandran

Last modified: Tue Jan 15 00:43:11 IST 2002