This is a simple introductory document to using Python for performance computing. This should hopefully get you started. We'll use Numeric, scipy.weave (blitz and inline), and also give an example of how to use f2py to use a Fortran subroutine from within Python.
We will also use this opportunity to benchmark the various ways to solve this problem and will compare this to an implementation of the algorithm in simple c++.
The example we will consider is a very simple (read, trivial) example of solving the 2D Laplace equation using an iterative finite difference scheme (four point averaging, Gauss-Seidel or Gauss-Jordan). The formal specification of the problem is as follows. We are required to solve for some unknown function u(x,y) such that \nabla^2 u = 0 with a boundary condition specified. For convenience the domain of interest is considered to be a rectangle and the boundary values at the sides of this rectangle are given.
It can be shown that this problem can be solved using a simple four point averaging scheme as follows. Discretise the domain into an (nx x ny) grid of points. Then the function u can be represented as a 2 dimensional array - u(nx, ny). The values of u along the sides of the rectangle are given. The solution can be obtained by iterating in the following manner.
for i in range(1, nx-1): for j in range(1, ny-1): u[i,j] = ((u[i-1, j] + u[i+1, j])*dy^2 + (u[i, j-1] + u[i, j+1])*dx^2)/(2.0*(dx^2 + dy^2))
Where dx and dy are the lengths along the x and y axis of the discretised domain.
Implementing a solver for this is straight forward in Pure Python using simple a simple Numeric array to store the solution matrix u. The following code demonstrates a simple solver.
#!/usr/bin/env python from Numeric import * from math import * class Grid: """A simple grid class that stores the details and solution of the computational grid.""" def __init__(self, nx=10, ny=10, xmin=0.0, xmax=1.0, ymin=0.0, ymax=1.0): self.xmin, self.xmax, self.ymin, self.ymax = xmin, xmax, ymin, ymax self.dx = float(xmax-xmin)/(nx-1) self.dy = float(ymax-ymin)/(ny-1) self.u = zeros((nx, ny), 'd') self.old_u = self.u.copy() # used to compute the change in solution def setBCFunc(self, func): """Sets the Boundary condition given a function of two variables.""" xmin, ymin = self.xmin, self.ymin xmax, ymax = self.xmax, self.ymax x = arange(xmin, xmax + self.dx*0.5, self.dx) y = arange(ymin, ymax + self.dy*0.5, self.dy) self.u[0 ,:] = func(xmin,y) self.u[-1,:] = func(xmax,y) self.u[:, 0] = func(x,ymin) self.u[:,-1] = func(x,ymax) def computeError(self): """Computes absolute error using an L2 norm for the solution. This requires that self.u and self.old_u must be appropriately setup.""" v = (self.u - self.old_u).flat return sqrt(dot(v,v)) class LaplaceSolver: """A simple Laplacian solver that can use different schemes to solve the problem.""" def __init__(self, grid, stepper='numeric'): self.grid = grid def timeStep(self, dt=0.0): """Takes a time step using straight forward Python loops.""" g = self.grid nx, ny = g.u.shape dx2, dy2 = g.dx**2, g.dy**2 dnr_inv = 0.5/(dx2 + dy2) u = g.u err = 0.0 for i in range(1, nx-1): for j in range(1, ny-1): tmp = u[i,j] u[i,j] = ((u[i-1, j] + u[i+1, j])*dy2 + (u[i, j-1] + u[i, j+1])*dx2)*dnr_inv diff = u[i,j] - tmp err += diff*diff return sqrt(err) def solve(self, n_iter=0, eps=1.0e-16): err = self.timeStep() count = 1 while err > eps: if n_iter and count >= n_iter: return err err = self.timeStep() count = count + 1 return count
The code is pretty simple and very easy to write but if you run it for any sizeable problem (say a 500 x 500 grid of points). You'll find that it takes *forever* to run. The CPU hog in this case will be the timeStep method. In the next section we will speed it up using NumPy.
It turns out that the innermost loop of the LaplaceSolver.timeStep method can be readily expressed by a much simpler NumPy expression. Here is a re-written timeStep method.
def timeStep(self, dt=0.0): """Takes a time step using a numeric expressions.""" g = self.grid dx2, dy2 = g.dx**2, g.dy**2 dnr_inv = 0.5/(dx2 + dy2) u = g.u g.old_u = u.copy() # needed to compute the error. # The actual iteration u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv return g.computeError()
The entire for i and j loops have been replaced by a single Numeric expression. Numeric expressions operate element wise and hence the above expression works. It basically computes the four point average. If you have gone through the numeric tutorial and played with Numeric a bit you should be able to understand how this works. The beauty of the expression is that its completely done in C. This makes the computation *much* faster. For a quick comparison here are some numbers for a single iteration on a 500x500 grid. On a PIII 450Mhz with 192 MB RAM, the above takes about 0.35 seconds whereas the slower one takes a whopping 18.7 seconds. This is already a 53 fold speed increase. You will also note a few things.
Apart from these two issues its clear that using Numeric boosts speed tremendously. We will now use the amazing scipy.weave to speed this up further.
The Numeric expression can be speeded up quite a bit if we use weave.blitz. Here is the new function.
# import necessary modules and functions import scipy.weave def timeStep(self, dt=0.0): """Takes a time step using a numeric expression that has been blitzed using weave.""" g = self.grid dx2, dy2 = g.dx**2, g.dy**2 dnr_inv = 0.5/(dx2 + dy2) u = g.u g.old_u = u.copy() # The actual iteration expr = "u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + "\ "(u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv" scipy.weave.blitz(expr, check_size=0) return g.computeError()
If you notice the only thing that has changed is that we put quotes around the original numeric expression and call this string 'expr' and then invoke scipy.weave.blitz. The check_size keyword when set to 1 does a few sanity checks and is to be used where you are debugging your code. However for pure speed its is wise to set it to 0. This time when we time the code for a 500x500 array for a single iteration it takes only about 0.12 seconds which is about a three fold increase! There are again a few things to note.
Apart from these two the results are identical as compared to the original for loop. Its only about 170 times faster than the original code! We will now look at yet another way to speed up our original for loop. Enter scipy.weave.inline!
Inline allows one to embed C code directly into your Python code. So here is a simple version of an inlined version of the code.
def inlineTimeStep(self, dt=0.0): """Takes a time step using inlined C code -- this version uses blitz arrays.""" g = self.grid nx, ny = g.u.shape dx2, dy2 = g.dx**2, g.dy**2 dnr_inv = 0.5/(dx2 + dy2) u = g.u code = """ #line 120 "laplace.py" (This is only useful for debugging) double tmp, err, diff; err = 0.0; for (int i=1; i<nx-1; ++i) { for (int j=1; j<ny-1; ++j) { tmp = u(i,j); u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 + (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv; diff = u(i,j) - tmp; err += diff*diff; } } return_val = Py::new_reference_to(Py::Float(sqrt(err))); """ # compiler keyword only needed on windows with MSVC installed err = scipy.weave.inline(code, ['u', 'dx2', 'dy2', 'dnr_inv', 'nx','ny'], type_factories = blitz_type_factories, compiler = 'gcc', extra_compile_args = ['-O3','-malign-double','-funroll-loops']) return err
The code itself looks very straightforward (which is what makes inline so cool). The inline function call arguments are all self explanatory. The line with '#line 120 ...' is only used for debugging and doesn't affect the speed in anyway. Again the first time you run this function it takes a long while to do something and then next time round it blazes away. This time notice that we have far more flexibility in our loop and can easily compute an error term without an need for temporary arrays. Timing this version results in a time for a 500x500 array of a mere 0.04 seconds per iteration! This corresponds to a whopping 430 fold speed increase over the plain old for loop. And remember we haven't sacrificed any of Pythons incredible flexibility! This loop contains code that looks very nice but if we want to we can speed things up further by writing a little dirty code. We won't get into that here but it suffices to say that its possible to get a further speed up of about 50% by using a different version.
Finally we look at how its possible to easily implement the loop inside Fortran and call it from Python by using f2py.
f2py is an amazing utility that lets you easily call Fortran functions from Python. First we will write a small Fortran subroutine to do our calculation. Here is the code.
c File flaplace.f subroutine timestep(u,n,m,dx,dy,error) double precision u(n,m) double precision dx,dy,dx2,dy2,dnr_inv,tmp,diff integer n,m,i,j cf2py intent(in) :: dx,dy cf2py intent(in,out) :: u cf2py intent(out) :: error cf2py intent(hide) :: n,m dx2 = dx*dx dy2 = dy*dy dnr_inv = 0.5d0 / (dx2+dy2) error = 0 do 200,i=2,n-1 do 100,j=2,m-1 tmp = u(i,j) u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2+ & (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv diff = u(i,j) - tmp error = error + diff*diff 100 continue 200 continue error = sqrt(error) end
The lines starting with cf2py are special f2py directives and are documented in f2py. The rest of the code is straightforward for those who know some Fortran. We trivially create a Python module for this using the following command.
% f2py -c flaplace.f -m flaplace
You need the latest version of f2py to do this. Here is how the Python side of things looks.
import flaplace def fortranTimeStep(self, dt=0.0): """Takes a time step using a simple fortran module that implements the loop in Fortran. """ g = self.grid u, err = flaplace.timestep(g.u, g.dx, g.dy) return err
Thats it! Hopefully someday scipy.weave will let us do this inline and not require us to write a separate Fortran file. Anyway, using this module it takes about 0.03 seconds for a 500x500 grid per iteration. This is about a 550 fold speed increase over the original code and about 40% faster than the slower version of the inlined C. The faster inlined C code is about 10% faster than the Fortran version though.
Finally we implemented this in simple C++ (nothing fancy) without any Python just for comparison. One should expect that C will be faster but the surprising thing is that its not by much! Given the fact that its so easy to develop with Python this speed reduction really is not much of an issue. Here are some timing results for a 500x500 grid for 100 iterations.
Type of solution |
Time taken (sec) |
Slow python (estimate) |
1897.0 |
Numeric |
29.0 |
Blitz |
10.2 |
Inline |
4.3 |
Fast Inline |
2.9 |
Python/Fortran |
3.2 |
Pure C++ |
2.4 |
This is pretty amazing considering the flexibility and power of Python.
The file laplace.py incorporates all of these methods so you can test these out for yourselves. Also included are the Fortran code and the pure C++ code that you can compare with.