[Python-Dev] Support for Linux perf

Mon Nov 17 23:09:57 CET 2014

Hi,

The PEP-418 is about performance counters, but there is no mention of
performance management unit (PMU) counters, such as cache misses and
instruction counts.

The Linux perf tool aims at recording these samples at the system level. I
ran linux perf on CPython for profiling. The resulting callstack is inside
libpython.so, mostly recursive calls to PyEval_EvalFrameEx(), because the
tool works at the ELF level. Here is an example with a dummy program
(linux-tools on Ubuntu 14.04):

$ perf record python crunch.py
$ perf report --stdio
# Overhead  Command       Shared Object                            Symbol
# ........  .......  ..................  ................................
#
    32.37%   python  python2.7           [.] PyEval_EvalFrameEx
    13.70%   python  libm-2.19.so        [.] __sin_avx
     5.25%   python  python2.7           [.] binary_op1.5010
     4.82%   python  python2.7           [.] PyObject_GetAttr

While this may be insightful for the interpreter developers, it it not so
for the average Python developer. The report should display Python code
instead. It seems obvious, still I haven't found the feature for that.

When a performance counter reaches a given value, a sample is recorded. The
most basic sample only records a timestamps, thread ID and the program
counter (%rip). In addition, all executable memory maps of libraries are
recorded. For the callstack, frame pointers are traversed, but most of the
time, they are optimized on x86, so there is a fall back to unwind, which
requires saving register values and a chunk of the stack. The memory space
of the process is reconstructed offline.

CPython seems to allocates code and frames on mmap() pages. If the data is
outside about 1k from the top of stack, it is not available offline in the
trace. We need some way to reconstitute this memory space of the
interpreter to resolve the symbols, probably by  dumping the data on disk.

In Java, there is a small HotSpot agent that spits out the symbols of JIT
code:

https://github.com/jrudolph/perf-map-agent

The problem is that CPython does not JIT code, and executed code is the ELF
library itself. The executed frames are parameters of functions of the
interpreter. I don't think the same approach can be used (maybe this can be
applied to PyPy?).

I looked at how Python frames are handled in GDB
(file cpython/Tools/gdb/libpython.py). A python frame is detected in
Frame(gdbframe).is_evalframeex() by a C call to PyEval_EvalFrameEx().
However, the traceback accesses PyFrameObject on the heap (at least for
f->f_back = 0xa57460), which is possible in GDB when the program is paused
and the whole memory space is available, but is not recorded for offline
use in perf. Here is an example of callstack from GDB:

#0  PyEval_EvalFrameEx (f=Frame 0x7ffff7f1b060, for file crunch.py, line 7,
in bar (num=466829),
    throwflag=0) at ../Python/ceval.c:1039
#1  0x0000000000527877 in fast_function (func=<function at remote
0x7ffff6ec45a0>,
    pp_stack=0x7fffffffd280, n=1, na=1, nk=0) at ../Python/ceval.c:4106
#2  0x0000000000527582 in call_function (pp_stack=0x7fffffffd280, oparg=1)
at ../Python/ceval.c:4041

We could add a kernel module that "knows" how to make samples of CPython,
but it means python structures becomes sort of ABI, and kernel devs won't
allow a python interpreter in kernel mode ;-).

What we really want is f_code data and related objects:

(gdb) print (void *)(f->f_code)
$8 = (void *) 0x7ffff7e370f0

Maybe we could save these pages every time some code is loaded from the
interpreter? (the memory range is about 1.7MB, but )

Anyway, I think we must change CPython to support tools such as perf. Any
thoughts?

Cheers,

Francis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20141117/4a63989a/attachment.html>