[Numpy-discussion] Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)

Thu May 2 12:15:36 EDT 2013

On Thu, May 2, 2013 at 10:51 AM, Francesc Alted <francesc at continuum.io> wrote:
> On 5/2/13 3:58 PM, Nathaniel Smith wrote:
>> callgrind has the *fabulous* kcachegrind front-end, but it only
>> measures memory access performance on a simulated machine, which is
>> very useful sometimes (if you're trying to optimize cache locality),
>> but there's no guarantee that the bottlenecks on its simulated machine
>> are the same as the bottlenecks on your real machine.
>
> Agreed, there is no guarantee, but my experience is that kcachegrind
> normally gives you a pretty decent view of cache faults and hence it can
> do pretty good predictions on how this affects your computations.  I
> have used this feature extensively for optimizing parts of the Blosc
> compressor, and I cannot be more happier (to the point that, if it were
> not for Valgrind, I could not figure out many interesting memory access
> optimizations).

Right -- if you have code where you know that memory is the bottleneck
(so esp. integer-heavy code), then callgrind is perfect. In fact it
was originally written to make it easier to optimize the bzip2
compressor :-). My point isn't that it's not useful, just, it's a
little more of a specialist tool, so I hesitate to recommend it as the
first profiler for people to reach for. An extreme example would be,
last time I played with this, I found that for numpy scalar float64 *
float64, 50% of the total time was in fiddling with floating point
control registers. But that time would be invisible to callgrind's
measurements...

-n