Following up on the discussion of small-array performance, I decided
to profile Numeric and numarray. I've been playing with Pyrex,
implementing a vector class (for doubles) that runs as fast as I can
make it, and it's beating Numeric and numarray by a good bit, so I
figured those two were doing something. I can't say much about
numarray right now (most of the runtime is in complicated Python code,
with weird callbacks to and from C code), but for Numeric it was easy
to find a problem.
First, profiling Python extensions is not easy. I've played with using
the -pg flag to the GCC C compiler, but I couldn't find a way to
profile the extension itself (even with Python built with -pg). So I
see about three ways to profile an extension:
1) Wrap each function in the extension in a pair of calls to something
that keeps track of time in the function. This is what the
numarray.teacup module does. This is unsatisfying, intrusive,
labour-intensive, and it does manually what -pg does automatically.
2) Compile the extension into the Python executable (where both the
extension and Python have been compiled with the -pg flag).
Unfortunately, as far as I can tell, this is not possible with
distutils. If you have another build framework, however, it's not
that hard to do. I've had some success with this approach with
other extensions.
3) Use oprofile (http://oprofile.sourceforge.net/), which runs on
Linux on a x86 processor. This is the approach that I've used here.
oprofile is a combination of a kernel module for Linux, a daemon
for collecting sample data, and several tools to analyse the
samples. It periodically polls the processor performance counters,
and records which code is running. It's a system-level profiler: it
profiles _everything_ that's running on the system. One obstacle is
that does require root access.
Short tutorial on using oprofile
--------------------------------
Using oprofile on Debian with a 2.6.x kernel is easy
(replace sudo with your favourite "be-root" method):
$ sudo apt-get install oprofile
$ sudo modprobe oprofile # make sure the kernel module is installed
Now, start the oprofile daemon. On my machine, --event=default just
looks at retired CPU instructions. Read the oprofile documentation for
more info.
$ sudo opcontrol --event=default --start
$ (run code)
$ opcontrol --dump # dump the statistics to disk
# this is the only thing a non-root user can do
$ sudo opcontrol --stop # we don't need the daemon anymore
To do another profile run, you need to reset the data
$ sudo opcontrol --reset
You should be able to to the above when the daemon is running, but I
the daemon crashes on me when I do that; I find I end up having to
also clear the old statistics manually:
$ sudo rm -rf /var/lib/oprofile/samples
Once you've collected samples, you can analyse the results. Here, I'll
be looking at adding two 1000-element arrays with the following code:
import Numeric as NA
a = NA.arange(1000.0)
b = NA.arange(1000.0)
for i in xrange(10000000):
a + b
This takes 1m14s on my machine (an AMD64 3200+ running in 64-bit mode).
So, where I have (run code) above, I'd do
$ python x.py
Once I've collected the samples, I can analyse them. Note that samples
are collected on a per-application basis; if you've got other
processes using python, they'll be included. You could copy the python
binary to another location, and use that for the analysis, then your
program would be the only picked by the following analysis.
$ opstack -t 1 /usr/bin/python
self % child % image name symbol name
132281 10.5031 0 0 python (no symbols)
-------------------------------------------------------------------------------
704810 55.9618 0 0 _numpy.so check_array
-------------------------------------------------------------------------------
309384 24.5650 0 0 umath.so DOUBLE_add
-------------------------------------------------------------------------------
112974 8.9701 0 0 libc-2.3.2.so (no symbols)
-------------------------------------------------------------------------------
The -t 1 limits the display to those routines taking more than 1% of
the runtime. 10% for python, and 10% for the C-library probably aren't
so bad (I'm thinking that's calls to malloc() and friends). However,
the big problem is that only 25% of the time is actually doing useful
work. What's check_array doing? We can delve deeper:
$ mkdir profile-Numeric
$ opannotate --source -o profile-Numeric \
--search-dir=<path to Numeric source> /usr/bin/python
Now profile-Numeric/<path to Numeric source>/Src has annotated copies
of the source for the Numeric extension modules. The definition of
check_array is in ufuncobject.c, which gives us
386 0.0286 :void check_array(PyArrayObject *ap) { /* check_array total: 7046
: double *data;
: int i, n;
:
371 0.0275 : if (ap->descr->type_num == PyArray_DOUBLE || ap->descr->type_num == PyArray_CDOUBLE) {
89 0.0066 : data = (double *)ap->data;
758 0.0563 : n = PyArray_Size((PyObject *)ap);
46 0.0034 : if (ap->descr->type_num == PyArray_CDOUBLE) n *= 2;
:
700662 51.9988 : for(i=0; i import Numeric
a = Numeric.array([1e308])
a + a
array([ inf]) It will catch NaN's though. It's obvious when you realize that
HUGE_VAL is inf; inf <= inf is true.
With HAVE_FINITE defined, I get, for the same a array, a + a
<traceback>
OverflowError: math range error The Numeric documentation has this to say about the check_return
parameter to PyUFunc_FromFuncAndData (which determines whether
check_array is called):
Usually best to set to 1. If this is non-zero then returned
matrices will be cleaned up so that rank-0 arrays will be returned
as python scalars. Also, if non-zero, then any math error that
sets the errno global variable will cause an appropriate Python
exception to be raised.
Note that the rank-0 array -> scalar conversion happens regardless.
check_return doesn't affect this at all.
Removing check_array
--------------------
Commenting out the body of check_array in ufuncobject.c speeds up the
script above by *two* times. On my iBook (a G4 800), it speeds it up
by *four* times.
Using timeit.py:
$ python /usr/lib/python2.3/timeit.py \
-s 'import Numeric as NA; N=1e4; a=NA.arange(N)/N; b=NA.arange(N)/N' \
'a+b'
I get for various values of N:
N Stock Numeric numarray numarray
Numeric without recent 1.1.1
23.7 check_array CVS
1e1 1.13 1.08 10.5 9.9
1e2 1.73 1.35 10.8 10.6
1e3 6.91 3.2 13.3 12.9
1e4 83.3 42.5 52.8 52.3
1e5 4890 4420 4520 4510
1e6 52700 47400 47100 47000
1e7 532000 473000 476000 474000
Numeric is as fast as numarray now! :-) The 10x change in per-element
speed between 1e4 and 1e5 is due to cache effects.
N Stock Numeric numarray numarray
Numeric without recent 1.1.1
23.7 check_array CVS
1e1 1.31 1.28 8.49 7.64
1e2 5.86 5.44 14.4 12.1
1e3 51.8 48 70.4 54.5
1e4 542 502 643 508
1e5 7480 6880 7430 6850
1e6 77500 70700 82700 69100
1e7 775000 710000 860000 694000
Numeric is faster than numarray from CVS, but there seems to be
regression. Without check_array, Numeric is almost as fast as as
numarray 1.1.1.
Remarks
-------
- I'd rather have my speed than checks for NaN's. Have that in a
separate function (I'm willing to write one), or do numarray-style
processor flag checks (tougher).
- General plea: *please*, *please*, when releasing a library for which
speed is a selling point, profile it first!
- doing the same profiling on numarray finds 15% of the time actually
adding, 65% somewhere in python, and 15% in libc.
- I'm still fiddling. Using the three-argument form of Numeric.add (so
no memory allocation needs to be done), 64% of the time is now spend
adding; I think that could be better. The Pyrex vector class I'm
working on does 80% adding (with memory allocation for the result).
Hope this helps :-)
--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/
|cookedm@physics.mcmaster.ca