[Numpy-discussion] Speeding up Numeric

21 Jan 2005

      Following up on the discussion of small-array performance, I decided
to profile Numeric and numarray. I've been playing with Pyrex,
implementing a vector class (for doubles) that runs as fast as I can
make it, and it's beating Numeric and numarray by a good bit, so I
figured those two were doing something. I can't say much about
numarray right now (most of the runtime is in complicated Python code,
with weird callbacks to and from C code), but for Numeric it was easy
to find a problem.

First, profiling Python extensions is not easy. I've played with using
the -pg flag to the GCC C compiler, but I couldn't find a way to
profile the extension itself (even with Python built with -pg). So I
see about three ways to profile an extension:

1) Wrap each function in the extension in a pair of calls to something
   that keeps track of time in the function. This is what the
   numarray.teacup module does. This is unsatisfying, intrusive,
   labour-intensive, and it does manually what -pg does automatically.

2) Compile the extension into the Python executable (where both the
   extension and Python have been compiled with the -pg flag).
   Unfortunately, as far as I can tell, this is not possible with
   distutils. If you have another build framework, however, it's not
   that hard to do. I've had some success with this approach with
   other extensions.

3) Use oprofile (http://oprofile.sourceforge.net/), which runs on
   Linux on a x86 processor. This is the approach that I've used here.
   oprofile is a combination of a kernel module for Linux, a daemon
   for collecting sample data, and several tools to analyse the
   samples. It periodically polls the processor performance counters,
   and records which code is running. It's a system-level profiler: it
   profiles _everything_ that's running on the system. One obstacle is
   that does require root access.

Short tutorial on using oprofile
--------------------------------

Using oprofile on Debian with a 2.6.x kernel is easy
(replace sudo with your favourite "be-root" method):

$ sudo apt-get install oprofile
$ sudo modprobe oprofile        # make sure the kernel module is installed

Now, start the oprofile daemon. On my machine, --event=default just
looks at retired CPU instructions. Read the oprofile documentation for
more info.

$ sudo opcontrol --event=default --start
$ (run code)
$ opcontrol --dump              # dump the statistics to disk
                                # this is the only thing a non-root user can do
$ sudo opcontrol --stop         # we don't need the daemon anymore

To do another profile run, you need to reset the data
$ sudo opcontrol --reset
You should be able to to the above when the daemon is running, but I
the daemon crashes on me when I do that; I find I end up having to
also clear the old statistics manually:
$ sudo rm -rf /var/lib/oprofile/samples

Once you've collected samples, you can analyse the results. Here, I'll
be looking at adding two 1000-element arrays with the following code:
import Numeric as NA
a = NA.arange(1000.0)
b = NA.arange(1000.0)
for i in xrange(10000000):
    a + b

This takes 1m14s on my machine (an AMD64 3200+ running in 64-bit mode).

So, where I have (run code) above, I'd do
$ python x.py

Once I've collected the samples, I can analyse them. Note that samples
are collected on a per-application basis; if you've got other
processes using python, they'll be included. You could copy the python
binary to another location, and use that for the analysis, then your
program would be the only picked by the following analysis.

$ opstack -t 1 /usr/bin/python
  self     %        child    %        image name               symbol name
132281   10.5031  0              0  python                   (no symbols)
-------------------------------------------------------------------------------
704810   55.9618  0              0  _numpy.so                check_array
-------------------------------------------------------------------------------
309384   24.5650  0              0  umath.so                 DOUBLE_add
-------------------------------------------------------------------------------
112974    8.9701  0              0  libc-2.3.2.so            (no symbols)
-------------------------------------------------------------------------------

The -t 1 limits the display to those routines taking more than 1% of
the runtime. 10% for python, and 10% for the C-library probably aren't
so bad (I'm thinking that's calls to malloc() and friends). However,
the big problem is that only 25% of the time is actually doing useful
work. What's check_array doing? We can delve deeper:

$ mkdir profile-Numeric
$ opannotate --source -o profile-Numeric \
    --search-dir=<path to Numeric source> /usr/bin/python

Now profile-Numeric/<path to Numeric source>/Src has annotated copies
of the source for the Numeric extension modules. The definition of
check_array is in ufuncobject.c, which gives us
   386  0.0286 :void check_array(PyArrayObject *ap) { /* check_array total: 7046
               :    double *data;
               :    int i, n;
               :        
   371  0.0275 :    if (ap->descr->type_num == PyArray_DOUBLE || ap->descr->type_num == PyArray_CDOUBLE) {
    89  0.0066 :        data = (double *)ap->data;
   758  0.0563 :        n = PyArray_Size((PyObject *)ap);
    46  0.0034 :        if (ap->descr->type_num == PyArray_CDOUBLE) n *= 2;
               :
700662 51.9988 :        for(i=0; i
...
...
...
import Numeric
a = Numeric.array([1e308])
a + a
array([              inf])
It will catch NaN's though. It's obvious when you realize that
HUGE_VAL is inf; inf <= inf is true.

With HAVE_FINITE defined, I get, for the same a array,
...
...
...
a + a
<traceback>
OverflowError: math range error
The Numeric documentation has this to say about the check_return
parameter to PyUFunc_FromFuncAndData (which determines whether
check_array is called):
    Usually best to set to 1. If this is non-zero then returned
    matrices will be cleaned up so that rank-0 arrays will be returned
    as python scalars. Also, if non-zero, then any math error that
    sets the errno global variable will cause an appropriate Python
    exception to be raised.

Note that the rank-0 array -> scalar conversion happens regardless.
check_return doesn't affect this at all.

Removing check_array
--------------------

Commenting out the body of check_array in ufuncobject.c speeds up the
script above by *two* times. On my iBook (a G4 800), it speeds it up
by *four* times.

Using timeit.py:
$ python /usr/lib/python2.3/timeit.py \
  -s 'import Numeric as NA; N=1e4; a=NA.arange(N)/N; b=NA.arange(N)/N' \
  'a+b'

I get for various values of N:

N     Stock       Numeric         numarray      numarray
      Numeric     without         recent        1.1.1
      23.7        check_array     CVS
1e1   1.13        1.08            10.5          9.9
1e2   1.73        1.35            10.8          10.6
1e3   6.91        3.2             13.3          12.9
1e4   83.3        42.5            52.8          52.3
1e5   4890        4420            4520          4510
1e6   52700       47400           47100         47000
1e7   532000      473000          476000        474000

Numeric is as fast as numarray now! :-) The 10x change in per-element
speed between 1e4 and 1e5 is due to cache effects.

N     Stock       Numeric         numarray      numarray
      Numeric     without         recent        1.1.1
      23.7        check_array     CVS
1e1   1.31        1.28            8.49          7.64
1e2   5.86        5.44            14.4          12.1
1e3   51.8        48              70.4          54.5
1e4   542         502             643           508
1e5   7480        6880            7430          6850
1e6   77500       70700           82700         69100
1e7   775000      710000          860000        694000

Numeric is faster than numarray from CVS, but there seems to be
regression. Without check_array, Numeric is almost as fast as as
numarray 1.1.1.

Remarks
-------

- I'd rather have my speed than checks for NaN's. Have that in a
  separate function (I'm willing to write one), or do numarray-style
  processor flag checks (tougher).

- General plea: *please*, *please*, when releasing a library for which
  speed is a selling point, profile it first!

- doing the same profiling on numarray finds 15% of the time actually
  adding, 65% somewhere in python, and 15% in libc.

- I'm still fiddling. Using the three-argument form of Numeric.add (so
  no memory allocation needs to be done), 64% of the time is now spend
  adding; I think that could be better. The Pyrex vector class I'm
  working on does 80% adding (with memory allocation for the result).

Hope this helps :-)

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm@physics.mcmaster.ca

[Numpy-discussion] Speeding up Numeric

cookedm＠physics.mcmaster.ca