A beginners guide to using Python for performance computing.
This is a simple introductory document to using Python for
performance computing. This should hopefully get you started.
We'll use Numeric, scipy.weave (blitz and inline), and also give an
example of how to use f2py to use a Fortran subroutine from within
Python.
We will also use this opportunity to benchmark the various ways to
solve this problem and will compare this to an implementation of the
algorithm in simple c++.
Problem description
The example we will consider is a very simple (read, trivial)
example of solving the 2D Laplace equation using an iterative
finite difference scheme (four point averaging, Gauss-Seidel
or
Gauss-Jordan). The formal specification of the problem
is as
follows. We are required to solve for some unknown
function u(x,y)
such that
\nabla^2 u = 0
with a boundary condition specified. For convenience
the domain
of interest is considered to be a rectangle and the boundary
values at the sides of this rectangle are given.
It can be shown that this problem can be solved using
a simple
four point averaging scheme as follows. Discretise
the domain
into an (nx x ny) grid of points. Then the function
u can be
represented as a 2 dimensional array - u(nx, ny).
The values of u
along the sides of the rectangle are given. The
solution can be
obtained by iterating in the following manner.
for i in range(1, nx-1):
for j
in range(1, ny-1):
u[i,j] = ((u[i-1, j] + u[i+1, j])*dy^2 +
(u[i, j-1] + u[i, j+1])*dx^2)/(2.0*(dx^2 + dy^2))
Where dx and dy are the lengths along the x and y axis
of the
discretised domain.
Numerical Solution
Implementing a solver for this is straight forward in
Pure Python
using simple a simple Numeric array to store the solution
matrix
u. The following code demonstrates a simple solver::
#!/usr/bin/env python
from Numeric import *
from math import *
class Grid:
"""A simple grid class that stores the details and solution
of the
computational grid."""
def __init__(self, nx=10, ny=10, xmin=0.0, xmax=1.0,
ymin=0.0, ymax=1.0):
self.xmin, self.xmax, self.ymin,
self.ymax = xmin, xmax, ymin, ymax
self.dx = float(xmax-xmin)/(nx-1)
self.dy = float(ymax-ymin)/(ny-1)
self.u = zeros((nx, ny), 'd')
self.old_u = self.u.copy() # used
to compute the change in solution
def setBCFunc(self, func):
"""Sets the Boundary condition
given a function of two variables."""
xmin, ymin = self.xmin, self.ymin
xmax, ymax = self.xmax, self.ymax
x = arange(xmin, xmax + self.dx*0.5,
self.dx)
y = arange(ymin, ymax + self.dy*0.5,
self.dy)
self.u[0 ,:] = func(xmin,y)
self.u[-1,:] = func(xmax,y)
self.u[:, 0] = func(x,ymin)
self.u[:,-1] = func(x,ymax)
def computeError(self):
"""Computes absolute error using
an L2 norm for the solution.
This requires that self.u and
self.old_u must be appropriately
setup."""
v = (self.u - self.old_u).flat
return sqrt(dot(v,v))
class LaplaceSolver:
"""A simple Laplacian solver that can use different schemes
to
solve the problem."""
def __init__(self, grid, stepper='numeric'):
self.grid = grid
def timeStep(self, dt=0.0):
"""Takes a time step using straight
forward Python loops."""
g = self.grid
nx, ny = g.u.shape
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
err = 0.0
for i in range(1, nx-1):
for j
in range(1, ny-1):
tmp = u[i,j]
u[i,j] = ((u[i-1, j] + u[i+1, j])*dy2 +
(u[i, j-1] + u[i, j+1])*dx2)*dnr_inv
diff = u[i,j] - tmp
err += diff*diff
return sqrt(err)
def solve(self, n_iter=0, eps=1.0e-16):
err = self.timeStep()
count = 1
while err > eps:
if n_iter
and count >= n_iter:
return err
err =
self.timeStep()
count
= count + 1
return count
The code is pretty simple and very easy to write but if
you run it
for any sizeable problem (say a 500 x 500 grid of points).
You'll
find that it takes forever to run. The CPU
hog in this case
will be the timeStep method. In the next section
we will speed it
up using NumPy.
Using NumPy
It turns out that the innermost loop of the LaplaceSolver.timeStep
method can be readily expressed by a much simpler NumPy
expression. Here is a re-written timeStep method::
def timeStep(self, dt=0.0):
"""Takes a time step using a numeric
expressions."""
g = self.grid
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
g.old_u = u.copy() # needed to
compute the error.
# The actual iteration
u[1:-1, 1:-1] = ((u[0:-2, 1:-1]
+ u[2:, 1:-1])*dy2 +
(u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv
return g.computeError()
The entire for i and j loops have been replaced by a single
Numeric expression. Numeric expressions operate
element wise and
hence the above expression works. It basically computes
the four
point average. If you have gone through the numeric
tutorial and
played with Numeric a bit you should be able to understand
how
this works. The beauty of the expression is that
its completely
done in C. This makes the computation much
faster. For a quick
comparison here are some numbers for a single iteration
on a
500x500 grid. On a PIII 450Mhz with 192 MB RAM,
the above takes
about 0.35 seconds whereas the slower one takes a whopping
18.7
seconds. This is already a 53 fold speed increase.
You will also
note a few things.
(1) We cannot compute the error the way we
did earlier inside
the for loop. We need to make a copy
of the data and then use
the computeError function to do this.
This costs us memory and
is not very pretty. This is certainly
a limitation but is worth
a 53 fold speed increase.
(2) The expression will use temporaries hence
during one
iteration, the computed values at a previous
index will not be
used for the next one. For instance
in the original for loop
once the value of u[1,1] say is computed,
the next value for
u[1,2] will use the newly computed u[1,1]
and not the old one.
However in the numeric expression since it
uses temporaries
internally only the old value of u[1,1] will
be used. This is
not a serious issue in this case because it
is known that even
when this happens the algorithm will converge.
Apart from these two issues its clear that using Numeric boosts
speed tremendously. We will now use the amazing scipy.weave
to
speed this up further.
Using scipy.weave.blitz
The Numeric expression can be speeded up quite a bit if we use
weave.blitz. Here is the new function::
# import necessary modules and functions
import scipy.weave
def timeStep(self, dt=0.0):
"""Takes a time step using a numeric
expression that has been
blitzed using weave."""
g = self.grid
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
g.old_u = u.copy()
# The actual iteration
expr = "u[1:-1, 1:-1] = ((u[0:-2,
1:-1] + u[2:, 1:-1])*dy2 + "\
"(u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv"
scipy.weave.blitz(expr, check_size=0)
return g.computeError()
If you notice the only thing that has changed is that
we put
quotes around the original numeric expression and call
this string
'expr' and then invoke scipy.weave.blitz. The check_size
keyword
when set to 1 does a few sanity checks and is to be used
where you
are debugging your code. However for pure speed
its is wise to
set it to 0. This time when we time the code for
a 500x500 array
for a single iteration it takes only about 0.12 seconds
which is
about a three fold increase! There are again a few
things to
note.
(1) The first time round the function
call will take a long
while to do some magic behind
your back. Next time round the
code will run immediately.
More details on this are is in the
weave documentation.
(2) Again we need to use a temporary
array to compute the
error.
(3) blitz does not use
temporaries for the computation and
therefore behaves more like the
original (slow) for loop in
that the computed values are re-used
immediately.
Apart from these two the results are identical as compared
to the
original for loop. Its only about 170 times faster
than the
original code! We will now look at yet another way
to speed up
our original for loop. Enter scipy.weave.inline!
Using scipy.weave.inline
Inline allows one to embed C code directly into your Python
code.
So here is a simple version of an inlined version of the
code::
def inlineTimeStep(self, dt=0.0):
"""Takes a time step using inlined
C code -- this version uses
blitz arrays."""
g = self.grid
nx, ny = g.u.shape
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
code = """
#line 120 "laplace.py" (This is only useful for debugging)
double tmp, err, diff;
err = 0.0;
for (int i=1; i<nx-1; ++i) {
for (int j=1; j<ny-1; ++j) {
tmp = u(i,j);
u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
(u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
diff = u(i,j) - tmp;
err += diff*diff;
}
}
return_val = Py::new_reference_to(Py::Float(sqrt(err)));
"""
# compiler keyword only needed
on windows with MSVC installed
err = scipy.weave.inline(code,
['u', 'dx2', 'dy2', 'dnr_inv', 'nx','ny'],
type_factories = blitz_type_factories,
compiler = 'gcc',
extra_compile_args =
['-O3','-malign-double','-funroll-loops'])
return err
The code itself looks very straightforward (which is what
makes
inline so cool). The inline function call arguments
are all self
explanatory. The line with '#line 120 ...' is only
used for
debugging and doesn't affect the speed in anyway.
Again the first
time you run this function it takes a long while to do
something
and then next time round it blazes away. This time
notice that we
have far more flexibility in our loop and can easily compute
an
error term without an need for temporary arrays.
Timing this
version results in a time for a 500x500 array of a mere
0.04
seconds per iteration! This corresponds to a whopping
430 fold
speed increase over the plain old for loop. And
remember we
haven't sacrificed any of Pythons incredible flexibility!
This
loop contains code that looks very nice but if we want
to we can
speed things up further by writing a little dirty code.
We won't
get into that here but it suffices to say that its possible
to get
a further speed up of about 50% by using a different version.
Finally we look at how its possible to easily implement
the loop
inside Fortran and call it from Python by using f2py.
Using f2py
f2py is an amazing utility that lets you easily call Fortran
functions from Python. First we will write a small
Fortran
subroutine to do our calculation. Here is the code::
c File flaplace.f
subroutine timestep(u,n,m,dx,dy,error)
double precision u(n,m)
double precision dx,dy,dx2,dy2,dnr_inv,tmp,diff
integer n,m,i,j
cf2py intent(in) :: dx,dy
cf2py intent(in,out) :: u
cf2py intent(out) :: error
cf2py intent(hide) :: n,m
dx2 = dx*dx
dy2 = dy*dy
dnr_inv = 0.5d0 / (dx2+dy2)
error = 0
do 200,i=2,n-1
do 100,j=2,m-1
tmp =
u(i,j)
u(i,j)
= ((u(i-1,j) + u(i+1,j))*dy2+
&
(u(i,j-1) + u(i,j+1))*dx2)*dnr_inv
diff =
u(i,j) - tmp
error
= error + diff*diff
100 continue
200 continue
error = sqrt(error)
end
The lines starting with cf2py are special f2py directives
and are
documented in f2py. The rest of the code is straightforward
for
those who know some Fortran. We trivially create
a Python module
for this using the following command::
% f2py -c flaplace.f -m flaplace
You need the latest version of f2py to do this.
Here is how the
Python side of things looks::
import flaplace
def fortranTimeStep(self, dt=0.0):
"""Takes a time step using a simple
fortran module that
implements the loop in Fortran.
"""
g = self.grid
u, err = flaplace.timestep(g.u,
g.dx, g.dy)
return err
Thats it! Hopefully someday scipy.weave will let
us do this
inline and not require us to write a separate Fortran
file.
Anyway, using this module it takes about 0.03 seconds
for a
500x500 grid per iteration. This is about a 550
fold speed
increase over the original code and about 40% faster than
the
slower version of the inlined C. The faster inlined
C code is
about 10% faster than the Fortran version though.
Plain old C++
Finally we implemented this in simple C++ (nothing fancy) without
any Python just for comparison. One should expect that
C will be
faster but the surprising thing is that its not by much!
Given the
fact that its so easy to develop with Python this speed reduction
really is not much of an issue. Here are some timing results
for a
500x500 grid for 100 iterations.
Type of code
|
Time taken (sec)
|
Slow python estimate
|
1897.0
|
Numeric
|
29.0
|
Blitz
|
10.2
|
Inline
|
4.3
|
Fast Inline
|
2.9
|
Python/Fortran
|
3.2
|
Pure C++
|
2.4
|
This is pretty amazing considering the flexibility and power
of
Python.
The file laplace.py incorporates all of these methods so you can test
these out for yourselves. Also included are the Fortran code and the
pure C++ code that you can compare with.