A beginners guide to using Python for performance computing.


  This is a simple introductory document to using Python for
  performance computing.  This should hopefully get you started.
  We'll use Numeric, scipy.weave (blitz and inline), and also give an
  example of how to use f2py to use a Fortran subroutine from within
  Python.

  We will also use this opportunity to benchmark the various ways to
  solve this problem and will compare this to an implementation of the
  algorithm in simple c++.

  Problem description

    The example we will consider is a very simple (read, trivial)
    example of solving the 2D Laplace equation using an iterative
    finite difference scheme (four point averaging, Gauss-Seidel or
    Gauss-Jordan).  The formal specification of the problem is as
    follows.  We are required to solve for some unknown function u(x,y)
    such that
 
        \nabla^2 u = 0

    with a boundary condition specified.  For convenience the domain
    of interest is considered to be a rectangle and the boundary
    values at the sides of this rectangle are given.

    It can be shown that this problem can be solved using a simple
    four point averaging scheme as follows.  Discretise the domain
    into an (nx x ny) grid of points.  Then the function u can be
    represented as a 2 dimensional array - u(nx, ny).  The values of u
    along the sides of the rectangle are given.  The solution can be
    obtained by iterating in the following manner.

         for i in range(1, nx-1):
            for j in range(1, ny-1):
                u[i,j] = ((u[i-1, j] + u[i+1, j])*dy^2 +
                          (u[i, j-1] + u[i, j+1])*dx^2)/(2.0*(dx^2 + dy^2))


    Where dx and dy are the lengths along the x and y axis of the
    discretised domain.
   

  Numerical Solution

    Implementing a solver for this is straight forward in Pure Python
    using simple a simple Numeric array to store the solution matrix
    u.  The following code demonstrates a simple solver::


#!/usr/bin/env python
from Numeric import *
from math import *

class Grid:    
    """A simple grid class that stores the details and solution of the
    computational grid."""
    def __init__(self, nx=10, ny=10, xmin=0.0, xmax=1.0,
                 ymin=0.0, ymax=1.0):
        self.xmin, self.xmax, self.ymin, self.ymax = xmin, xmax, ymin, ymax
        self.dx = float(xmax-xmin)/(nx-1)
        self.dy = float(ymax-ymin)/(ny-1)
        self.u = zeros((nx, ny), 'd')
        self.old_u = self.u.copy() # used to compute the change in solution

    def setBCFunc(self, func):
        """Sets the Boundary condition given a function of two variables."""
        xmin, ymin = self.xmin, self.ymin
        xmax, ymax = self.xmax, self.ymax
        x = arange(xmin, xmax + self.dx*0.5, self.dx)
        y = arange(ymin, ymax + self.dy*0.5, self.dy)
        self.u[0 ,:] = func(xmin,y)
        self.u[-1,:] = func(xmax,y)
        self.u[:, 0] = func(x,ymin)
        self.u[:,-1] = func(x,ymax)

    def computeError(self):        
        """Computes absolute error using an L2 norm for the solution.
        This requires that self.u and self.old_u must be appropriately
        setup."""        
        v = (self.u - self.old_u).flat
        return sqrt(dot(v,v))
    

class LaplaceSolver:    
    """A simple Laplacian solver that can use different schemes to
    solve the problem."""    
    def __init__(self, grid, stepper='numeric'):
        self.grid = grid

    def timeStep(self, dt=0.0):
        """Takes a time step using straight forward Python loops."""
        g = self.grid
        nx, ny = g.u.shape        
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u

        err = 0.0
        for i in range(1, nx-1):
            for j in range(1, ny-1):
                tmp = u[i,j]
                u[i,j] = ((u[i-1, j] + u[i+1, j])*dy2 +
                          (u[i, j-1] + u[i, j+1])*dx2)*dnr_inv
                diff = u[i,j] - tmp
                err += diff*diff

        return sqrt(err)
        
    def solve(self, n_iter=0, eps=1.0e-16):        
        err = self.timeStep()
        count = 1

        while err > eps:
            if n_iter and count >= n_iter:
                return err
            err = self.timeStep()
            count = count + 1

        return count


    The code is pretty simple and very easy to write but if you run it
    for any sizeable problem (say a 500 x 500 grid of points).  You'll
    find that it takes forever to run.  The CPU hog in this case
    will be the timeStep method.  In the next section we will speed it
    up using NumPy.

  Using NumPy

    It turns out that the innermost loop of the LaplaceSolver.timeStep
    method can be readily expressed by a much simpler NumPy
    expression.  Here is a re-written timeStep method::


    def timeStep(self, dt=0.0):
        """Takes a time step using a numeric expressions."""
        g = self.grid
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        g.old_u = u.copy() # needed to compute the error.

        # The actual iteration
        u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 +
                         (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv
        
        return g.computeError()


    The entire for i and j loops have been replaced by a single
    Numeric expression.  Numeric expressions operate element wise and
    hence the above expression works.  It basically computes the four
    point average.  If you have gone through the numeric tutorial and
    played with Numeric a bit you should be able to understand how
    this works.  The beauty of the expression is that its completely
    done in C.  This makes the computation much faster.  For a quick
    comparison here are some numbers for a single iteration on a
    500x500 grid.  On a PIII 450Mhz with 192 MB RAM, the above takes
    about 0.35 seconds whereas the slower one takes a whopping 18.7
    seconds.  This is already a 53 fold speed increase.  You will also
    note a few things.

      (1) We cannot compute the error the way we did earlier inside
      the for loop.  We need to make a copy of the data and then use
      the computeError function to do this.  This costs us memory and
      is not very pretty.  This is certainly a limitation but is worth
      a 53 fold speed increase.

      (2) The expression will use temporaries hence during one
      iteration, the computed values at a previous index will not be
      used for the next one.  For instance in the original for loop
      once the value of u[1,1] say is computed, the next value for
      u[1,2] will use the newly computed u[1,1] and not the old one.
      However in the numeric expression since it uses temporaries
      internally only the old value of u[1,1] will be used.  This is
      not a serious issue in this case because it is known that even
      when this happens the algorithm will converge.

   Apart from these two issues its clear that using Numeric boosts
   speed tremendously.  We will now use the amazing scipy.weave to
   speed this up further.


 Using scipy.weave.blitz

   The Numeric expression can be speeded up quite a bit if we use
   weave.blitz.  Here is the new function::

# import necessary modules and functions
import scipy.weave

    def timeStep(self, dt=0.0):
        """Takes a time step using a numeric expression that has been
        blitzed using weave."""        
        g = self.grid
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        g.old_u = u.copy()

        # The actual iteration
        expr = "u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + "\
               "(u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv"
        scipy.weave.blitz(expr, check_size=0)

        return g.computeError()


    If you notice the only thing that has changed is that we put
    quotes around the original numeric expression and call this string
    'expr' and then invoke scipy.weave.blitz.  The check_size keyword
    when set to 1 does a few sanity checks and is to be used where you
    are debugging your code.  However for pure speed its is wise to
    set it to 0.  This time when we time the code for a 500x500 array
    for a single iteration it takes only about 0.12 seconds which is
    about a three fold increase!  There are again a few things to
    note.

        (1) The first time round the function call will take a long
        while to do some magic behind your back.  Next time round the
        code will run immediately.  More details on this are is in the
        weave documentation.

        (2) Again we need to use a temporary array to compute the
        error.

        (3) blitz does not use temporaries for the computation and
        therefore behaves more like the original (slow) for loop in
        that the computed values are re-used immediately.

    Apart from these two the results are identical as compared to the
    original for loop.  Its only about 170 times faster than the
    original code!  We will now look at yet another way to speed up
    our original for loop.  Enter scipy.weave.inline!

  Using scipy.weave.inline


    Inline allows one to embed C code directly into your Python code.
    So here is a simple version of an inlined version of the code::

    def inlineTimeStep(self, dt=0.0):        
        """Takes a time step using inlined C code -- this version uses
        blitz arrays."""        
        g = self.grid
        nx, ny = g.u.shape
        dx2, dy2 = g.dx**2, g.dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u = g.u
        
        code = """
               #line 120 "laplace.py" (This is only useful for debugging)
               double tmp, err, diff;
               err = 0.0;
               for (int i=1; i<nx-1; ++i) {
                   for (int j=1; j<ny-1; ++j) {
                       tmp = u(i,j);
                       u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
                                 (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
                       diff = u(i,j) - tmp;
                       err += diff*diff;
                   }
               }
               return_val = Py::new_reference_to(Py::Float(sqrt(err)));
               """
        # compiler keyword only needed on windows with MSVC installed
        err = scipy.weave.inline(code,
                                 ['u', 'dx2', 'dy2', 'dnr_inv', 'nx','ny'],
                                 type_factories = blitz_type_factories,
                                 compiler = 'gcc',
                                 extra_compile_args =
                                 ['-O3','-malign-double','-funroll-loops'])
        return err

      
    The code itself looks very straightforward (which is what makes
    inline so cool).  The inline function call arguments are all self
    explanatory.  The line with '#line 120 ...' is only used for
    debugging and doesn't affect the speed in anyway.  Again the first
    time you run this function it takes a long while to do something
    and then next time round it blazes away.  This time notice that we
    have far more flexibility in our loop and can easily compute an
    error term without an need for temporary arrays.  Timing this
    version results in a time for a 500x500 array of a mere 0.04
    seconds per iteration!  This corresponds to a whopping 430 fold
    speed increase over the plain old for loop.  And remember we
    haven't sacrificed any of Pythons incredible flexibility!  This
    loop contains code that looks very nice but if we want to we can
    speed things up further by writing a little dirty code.  We won't
    get into that here but it suffices to say that its possible to get
    a further speed up of about 50% by using a different version.

    Finally we look at how its possible to easily implement the loop
    inside Fortran and call it from Python by using f2py.

  Using f2py

    f2py is an amazing utility that lets you easily call Fortran
    functions from Python.  First we will write a small Fortran
    subroutine to do our calculation.  Here is the code::

c File flaplace.f
      subroutine timestep(u,n,m,dx,dy,error)
      double precision u(n,m)
      double precision dx,dy,dx2,dy2,dnr_inv,tmp,diff
      integer n,m,i,j
cf2py intent(in) :: dx,dy
cf2py intent(in,out) :: u
cf2py intent(out) :: error
cf2py intent(hide) :: n,m
      dx2 = dx*dx
      dy2 = dy*dy
      dnr_inv = 0.5d0 / (dx2+dy2)
      error = 0
      do 200,i=2,n-1
         do 100,j=2,m-1
            tmp = u(i,j)
            u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2+
     &           (u(i,j-1) + u(i,j+1))*dx2)*dnr_inv
            diff = u(i,j) - tmp
            error = error + diff*diff
 100     continue
 200  continue
      error = sqrt(error)
      end


    The lines starting with cf2py are special f2py directives and are
    documented in f2py.  The rest of the code is straightforward for
    those who know some Fortran.  We trivially create a Python module
    for this using the following command::

      % f2py -c flaplace.f -m flaplace

    You need the latest version of f2py to do this.  Here is how the
    Python side of things looks::

import flaplace

    def fortranTimeStep(self, dt=0.0):
        """Takes a time step using a simple fortran module that
        implements the loop in Fortran.  """        
        g = self.grid
        u, err = flaplace.timestep(g.u, g.dx, g.dy)
        return err


    Thats it!  Hopefully someday scipy.weave will let us do this
    inline and not require us to write a separate Fortran file.
    Anyway, using this module it takes about 0.03 seconds for a
    500x500 grid per iteration.  This is about a 550 fold speed
    increase over the original code and about 40% faster than the
    slower version of the inlined C.  The faster inlined C code is
    about 10% faster than the Fortran version though.

  Plain old C++

   Finally we implemented this in simple C++ (nothing fancy) without
   any Python just for comparison.  One should expect that C will be
   faster but the surprising thing is that its not by much!  Given the
   fact that its so easy to develop with Python this speed reduction
   really is not much of an issue.  Here are some timing results for a
   500x500 grid for 100 iterations.  

   
Type of code
Time taken (sec)
Slow python estimate  
1897.0
Numeric
29.0
Blitz
10.2
Inline
4.3
Fast Inline
2.9
Python/Fortran
3.2
Pure C++
2.4

 
   This is pretty amazing considering the flexibility and power of
   Python.  

The file laplace.py incorporates all of these methods so you can test
these out for yourselves.  Also included are the Fortran code and the
pure C++ code that you can compare with.