[Numpy-discussion] Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)

David Cournapeau cournape at gmail.com
Sun May 5 17:57:45 EDT 2013

On Thu, May 2, 2013 at 2:58 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Thu, May 2, 2013 at 9:25 AM, David Cournapeau <cournape at gmail.com> wrote:
>>> * Re: the profiling, I wrote a full oprofile->callgrind format script
>>> years ago: http://vorpus.org/~njs/op2calltree.py
>>> Haven't used it in years either but neither oprofile nor kcachegrind
>>> are terribly fast-moving projects so it's probably still working, or
>>> could be made so without much work.
>>> Or easier is to use the gperftools CPU profiler:
>>> https://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
>> I don't have experience with gperftools, but on recent linux kernels,
>> you can also use perf, which can't be made easier to use (no runtime
>> support needed), but you need a 'recent' kernel
>> http://indico.cern.ch/getFile.py/access?contribId=20&sessionId=4&resId=0&materialId=slides&confId=141309
>> I am hoping to talk a bit about those for our diving into numpy c code
>> tutorial in June, what's the + of gperf in your opinion ?
> For what I've used profiling for, THE key feature is proper callgraph
> support ("show me the *total* time spent in each function, including
> callees"). Otherwise, silly example, let's say you have a bug where
> you wrote:
> func1() {
>   for (i = 0; i < 10000000; i++)
>      foo = add(foo, bar[0])
> }
> Obviously this is a waste of time, since you're actually performing
> the same operation over and over. Many profilers, given this, will
> tell you that all the time is spent in 'add', which is useless,
> because you don't want to speed up 'add', you want to speed up 'func1'
> (probably by not calling 'add' so many times!). If you have relatively
> flat code like most kernel code this isn't an issue, but I generally
> don't...
> perf is a fabulous framework and doesn't have any way to get full
> callgraph information out so IME it's been useless. They have
> reporting modes that claim to (like some "fractal" thing?) but AFAI
> been able to tell from docs/googling/mailing lists, there is nobody
> who understands how to interpret this output except the people who
> wrote it. Really a shame that it falls down in the last mile like
> that, hopefully they will fix this soon.

Perf doc is written for Vulcan, but it does what I think you want, say:

void work(int n) {
  volatile int i=0; //don't optimize away
  while(i++ < n);
void easy() { work(1000 * 1000 * 50); }
void hard() { work(1000*1000*1000); }
int main() { easy(); hard(); }

compile with gcc -g -O0, and then:

perf record -g -a -- ./a.out
perf report -g -a --stdio

gives me

  95.22%            a.out  a.out
  [.] work
                      --- work
                         |--89.84%-- hard
                         |          main
                         |          __libc_start_main
                          --5.38%-- easy

or maybe even better with the -G option

 95.22%            a.out  a.out
 [.] work
                      --- __libc_start_main
                         |--94.35%-- hard
                         |          work
                          --5.65%-- easy

I agree it would be better with kcachgrind-like GUI, but it looks like
it is coming. I don't really understand the ncurses frontend to the
call graph.

(example taken/adapted from the great Yosefk blog:


More information about the NumPy-Discussion mailing list