[Numpy-discussion] Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)

Nathaniel Smith njs at pobox.com
Thu May 2 09:58:52 EDT 2013

On Thu, May 2, 2013 at 9:25 AM, David Cournapeau <cournape at gmail.com> wrote:
>> * Re: the profiling, I wrote a full oprofile->callgrind format script
>> years ago: http://vorpus.org/~njs/op2calltree.py
>> Haven't used it in years either but neither oprofile nor kcachegrind
>> are terribly fast-moving projects so it's probably still working, or
>> could be made so without much work.
>> Or easier is to use the gperftools CPU profiler:
>> https://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
> I don't have experience with gperftools, but on recent linux kernels,
> you can also use perf, which can't be made easier to use (no runtime
> support needed), but you need a 'recent' kernel
> http://indico.cern.ch/getFile.py/access?contribId=20&sessionId=4&resId=0&materialId=slides&confId=141309
> I am hoping to talk a bit about those for our diving into numpy c code
> tutorial in June, what's the + of gperf in your opinion ?

For what I've used profiling for, THE key feature is proper callgraph
support ("show me the *total* time spent in each function, including
callees"). Otherwise, silly example, let's say you have a bug where
you wrote:

func1() {
  for (i = 0; i < 10000000; i++)
     foo = add(foo, bar[0])

Obviously this is a waste of time, since you're actually performing
the same operation over and over. Many profilers, given this, will
tell you that all the time is spent in 'add', which is useless,
because you don't want to speed up 'add', you want to speed up 'func1'
(probably by not calling 'add' so many times!). If you have relatively
flat code like most kernel code this isn't an issue, but I generally

perf is a fabulous framework and doesn't have any way to get full
callgraph information out so IME it's been useless. They have
reporting modes that claim to (like some "fractal" thing?) but AFAI
been able to tell from docs/googling/mailing lists, there is nobody
who understands how to interpret this output except the people who
wrote it. Really a shame that it falls down in the last mile like
that, hopefully they will fix this soon.

callgrind has the *fabulous* kcachegrind front-end, but it only
measures memory access performance on a simulated machine, which is
very useful sometimes (if you're trying to optimize cache locality),
but there's no guarantee that the bottlenecks on its simulated machine
are the same as the bottlenecks on your real machine.

oprofile is getting long in the tooth (superseded by perf), and it's
built-in reporting tools are merely ok, but it does have full
callgraph information and with the script above you can get the output
into kcachegrind.

perftools don't have all the fancy features of the in-kernel options,
but they're trivial to use, and their reporting options are genuinely
useful (though not quite as awesome as kcachegrind). So while in
theory it's the least whizz-bang awesome of all of these options, in
practice I find it the most useful.

(Also, beware of terminology collision, "gperf" is something else again...)


More information about the NumPy-Discussion mailing list