Sampling Profiler for Python

Hello world, since I moved from Java development to Python development, I really miss the sampling profiler of jvisualvm (http://visualvm.java.net/). Perhaps it is only me but especially for long running GUI applications, I would really like to know why it sometimes slows down to a crawl. cProfile is not really useful for this as it makes the software unresponsive alone. Also if using multiple threads (like doing some I/O in background, when using http://www.lag.net/paramiko/) it will give meaningless results. I looked for an existing profiler and found statprof, mentioned in this blog: http://wingolog.org/archives/2005/10/28/profiling However, this still fails profiling multiple threads. I don't think one can do any better without touching the interpreter core. So that's what I tried... My approach implements the profiler from the following parts: 1. Sample Trigger: A timer that triggers taking samples of all threads at fixed intervals. 2. Sampling function injection: Ensures that a sampling function is called on each thread at the first opportunity after the sample trigger fired. 3. Sampling function: A function that assigns the current time slice to the executing code on the thread it is called on. Only the second part has to be implemented in the Python interpreter core, the sample trigger and sampling functions can be provided by external modules. My implementation of (2) can be found at https://bitbucket.org/bluehorn/sampling_prof I also prepared a first implementation of (1) and (3) at https://bitbucket.org/bluehorn/profiler_util. Using that library, I can profile pybench with a 3 % slowdown (taking 100 profiling samples per second): $ export PYTHON_PATH=$HOME/python-work/profiler_util $ python3 -m sampling_profiler -- Tools/pybench/pybench.py -c stock.pybench -f prof.pybench [...] Totals: 3119ms 3072ms +1.5% 3198ms 3108ms +2.9% This also shows that the string benchmark takes the most time: Thread MainThread (3842 samples, 38.434401 seconds) ------------------------------------------------------------------------ cpu_time (cum) ticks (cum) samples (cum) filename:lineno function 0.548 0.548 55 55 55 55 Tools/pybench/Strings.py:455 test 0.448 0.448 45 45 45 45 Tools/pybench/Strings.py:568 calibrate 0.428 0.428 43 43 43 43 Tools/pybench/Constructs.py:484 test 0.380 0.380 38 38 38 38 Tools/pybench/Calls.py:445 f 0.324 0.324 34 34 34 34 Tools/pybench/Calls.py:122 f The output is explained as follows: cpu_time: cpu time in seconds accounted to that line. (cum): This is the cpu time including the cumulative time of the functions called on that line. Gathering that data means that the sampling function has to walk the stack for each sample, which will cause a dramatic slowdown for deeply recursive calls. There should be a switch to disable this. ticks: Number of sampling profiler ticks accounted to that line. For my example (where sample ticks are equidistant in wall time) this is a measure of the time spent in that line, including any blocking calls (especially I/O). (cum): ... including called functions samples: Number of times the sampling profiler function was called for that line. This should equal ticks unless execution of that line took longer than a profiler tick. The remaining output should be obvious. I have to admit that this is far from finished, but I hope to get some feedback and perhaps even help by posting it here. What do you think? Greetings, Torsten

Have you looked at yappi? yappi.start() accepts a sampling parameter. no, I haven't because I googled for sampling profiler or statistical
Hi Sümer, On 01/05/2012 07:45 AM, Sümer Cip wrote: profiler. I compared running pybench with yappi vs. my sampling profiler (with 100 samples/sec). First, the run time increases in both cases vs. the original performance of python. The figures are different because I backported to Python2 again for the comparison (is yappi available for Python3?) yappi: Totals: 4692ms 3317ms +41.5% 4831ms 3448ms +40.1% sampling_profiler: Totals: 3491ms 3269ms +6.8% 3638ms 3435ms +5.9% These are the hot spots as reported by yappi: name #n tsub ttot tavg Tools/pybench/Calls.py.test:439 10 1.709853 1.709853 0.170985 Tools/pybench/Strings.py.test:467 10 1.383400 2.133651 0.213365 Tools/pybench/With.py.test:16 10 1.179130 1.614330 0.161433 Tools/pybench/Calls.py.test:176 10 1.088372 1.885379 0.188538 Tools/pybench/Unicode.py.test:388 10 1.062785 1.729250 0.172925 Tools/pybench/Calls.py.test:118 10 1.008680 1.284282 0.128428 Tools/pybench/With.py.test:64 10 0.999503 1.431072 0.143107 Tools/pybench/Lists.py.test:9 10 0.995866 1.394482 0.139448 Tools/pybench/Calls.py.test:9 10 0.979050 1.505686 0.150569 Tools/pybench/Dict.py.test:351 10 0.959278 1.290247 0.129025 Compared with the sampling_profiler: cpu_time (cum) ticks (cum) samples (cum) filename:lineno function 0.552 4.224 55 419 55 419 Tools/pybench/Calls.py:446 f 0.512 0.512 52 52 52 52 Tools/pybench/Unicode.py:452 calibrate 0.436 0.436 44 44 44 44 Tools/pybench/Unicode.py:542 calibrate 0.436 0.436 44 44 44 44 Tools/pybench/Strings.py:562 calibrate 0.372 0.372 37 37 37 37 Tools/pybench/Constructs.py:485 test 0.348 0.348 36 36 36 36 Tools/pybench/Calls.py:122 f 0.272 0.272 27 27 27 27 Tools/pybench/Lists.py:339 test 0.248 0.248 25 25 25 25 Tools/pybench/Calls.py:445 f 0.228 0.228 22 22 22 22 Tools/pybench/Strings.py:247 calibrate 0.212 0.212 21 21 21 21 Tools/pybench/With.py:14 __exit__ Let's look at leading hot-spots: 1) yappi reports Calls.py line 439 as the hot spot. This refers to this code: def test(self): # <-- line 439 global f def f(x): if x > 1: return f(x-1) # <-- line 446 return 1 for i in xrange(self.rounds): f(10) f(10) ... Here yappi identified the outer function, the sampling profiler points directly to the line where most of the time is spent. 2) Most other hot spots reported by sampling_profiler lie in the calibrate functions. I wondered by that might be and noticed that pybench calls these 20 times per default. As the sampling_profiler reports the hot spots based on lines vs. the functions in yappi, this means we can not really compare these results. So let's compare the behaviour of the different profilers on this code: from time import time def fib(n): if n > 1: return fib(n-1) + fib(n-2) else: return n def fib2(n): stack = [n] while stack: n = stack.pop() if n > 1: stack.append(n-1) stack.append(n-2) s = time() fib(34) print "fib: %s seconds" % (time() - s) s = time() fib2(34) print "fib2: %s seconds" % (time() - s) Here are two variants of the fibonnaci function. The first works using recursion and returns the correct result. The second is just for simulating the same behaviour without recursion by using a local stack. The result is not computed since it is not required here - I just want to compare how this behaves wrt. our profilers. When I run this without any profiler, I get fib: 2.87187790871 seconds fib2: 5.26651716232 seconds With yappi, I get this: fib: 6.42756795883 seconds fib2: 14.6631779671 seconds name #n tsub ttot tavg demo.py.<module>:1 1 6.427689 21.090864 21.090864 demo.py.fib2:9 1 10.353305 14.663173 14.663173 ..atprof/Lib/posixpath.py.dirname:11 1 0.000004 0.000005 0.000005 ..prof/Lib/threading.py.setprofile:8 1 0.000001 0.000001 0.000001 ..n2.6/site-packages/yappi.py.stop:4 1 0.000000 0.000000 0.000000 demo.py.fib:3 184549.. 0.000000 0.000000 0.000000 The slow down for fib is about what I expected, but I have no idea why fib2 is slowed down that much. AFAIK, the usual profiling hooks of Python are called on entry and exit of each function, not on each line?! The results are incorrect for the fib function for reasons unknown to me - yappi thinks that no time is spent in that function, while the program output makes it obvious that more than 6 seconds are spent there. I tried using the timing_sample parameter to yappi.start. When set to 2, all of tsub, ttot and tavg are 0 in the output for this example. I can speed up the run to 4.97 seconds for fib and 9.33 s for fib2 using a high timing_sample value of 200 or more. That is still quite some overhead for keeping it active at all times. With the sampling profiler, the output is fib: 3.10812282562 seconds fib2: 5.91388201714 seconds Thread MainThread (902 samples, 9.036564 seconds) ------------------------------------------------------------------------ cpu_time (cum) ticks (cum) samples (cum) filename:lineno function 0.000 9.005 0 902 0 902 ython-work/statprof/Lib/runpy.py:122 _run_module_as_main 0.000 9.005 0 902 0 902 tatprof/Lib/sampling_profiler.py:203 main 0.000 9.005 0 902 0 902 tatprof/Lib/sampling_profiler.py:209 <module> 0.000 9.005 0 902 0 902 ython-work/statprof/Lib/runpy.py:34 _run_code 0.000 5.912 0 592 0 592 demo.py:21 <module> 2.068 3.092 211 310 211 310 demo.py:5 fib 0.000 3.092 0 310 0 310 demo.py:18 <module> 2.196 2.196 217 217 217 217 demo.py:12 fib2 1.576 1.576 157 157 157 157 demo.py:15 fib2 0.980 0.980 101 101 101 101 demo.py:14 fib2 0.900 0.900 87 87 87 87 demo.py:4 fib 0.644 0.644 67 67 67 67 demo.py:11 fib2 0.516 0.516 50 50 50 50 demo.py:13 fib2 0.124 0.124 12 12 12 12 demo.py:7 fib Just a small slow down here. And as you can see, this even tells you on what lines the most cpu time is spent. While I look at this, I am missing per-function totals, but that should be easy to add. As you can see, the profiler correctly detects that line 21 and line 18 of the demo code take most of the time (which are the module-level calls to fib2 and fib). For completeness, let's look what cProfile finds out: torsten@pulsar:~/python-work/statprof$ ./python run_cprofile.py demo.py fib: 5.83289813995 seconds fib2: 11.2586638927 seconds 55364794 function calls (36909866 primitive calls) in 17.092 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 17.092 17.092 <string>:1(<module>) 1 0.000 0.000 17.092 17.092 demo.py:1(<module>) 18454929/1 5.833 0.000 5.833 5.833 demo.py:3(fib) 1 7.831 7.831 11.259 11.259 demo.py:9(fib2) 18454928 1.421 0.000 1.421 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 18454929 2.007 0.000 2.007 0.000 {method 'pop' of 'list' objects} 4 0.000 0.000 0.000 0.000 {time.time} I am impressed how cProfile detects the recursive calls of the fib function (I guess that's the info in the ncalls column). Anyway, the slow down here is much more noticeable for the same information. So, why would I prefer the sampling profiler? a) Less slow down (could still use some optimization) b) Points directly to the code lines where the hot spots are (based on equidistant time samples). c) Does not distort run time by being active. Greetings, Torsten

Have you looked at yappi? yappi.start() accepts a sampling parameter. no, I haven't because I googled for sampling profiler or statistical
Hi Sümer, On 01/05/2012 07:45 AM, Sümer Cip wrote: profiler. I compared running pybench with yappi vs. my sampling profiler (with 100 samples/sec). First, the run time increases in both cases vs. the original performance of python. The figures are different because I backported to Python2 again for the comparison (is yappi available for Python3?) yappi: Totals: 4692ms 3317ms +41.5% 4831ms 3448ms +40.1% sampling_profiler: Totals: 3491ms 3269ms +6.8% 3638ms 3435ms +5.9% These are the hot spots as reported by yappi: name #n tsub ttot tavg Tools/pybench/Calls.py.test:439 10 1.709853 1.709853 0.170985 Tools/pybench/Strings.py.test:467 10 1.383400 2.133651 0.213365 Tools/pybench/With.py.test:16 10 1.179130 1.614330 0.161433 Tools/pybench/Calls.py.test:176 10 1.088372 1.885379 0.188538 Tools/pybench/Unicode.py.test:388 10 1.062785 1.729250 0.172925 Tools/pybench/Calls.py.test:118 10 1.008680 1.284282 0.128428 Tools/pybench/With.py.test:64 10 0.999503 1.431072 0.143107 Tools/pybench/Lists.py.test:9 10 0.995866 1.394482 0.139448 Tools/pybench/Calls.py.test:9 10 0.979050 1.505686 0.150569 Tools/pybench/Dict.py.test:351 10 0.959278 1.290247 0.129025 Compared with the sampling_profiler: cpu_time (cum) ticks (cum) samples (cum) filename:lineno function 0.552 4.224 55 419 55 419 Tools/pybench/Calls.py:446 f 0.512 0.512 52 52 52 52 Tools/pybench/Unicode.py:452 calibrate 0.436 0.436 44 44 44 44 Tools/pybench/Unicode.py:542 calibrate 0.436 0.436 44 44 44 44 Tools/pybench/Strings.py:562 calibrate 0.372 0.372 37 37 37 37 Tools/pybench/Constructs.py:485 test 0.348 0.348 36 36 36 36 Tools/pybench/Calls.py:122 f 0.272 0.272 27 27 27 27 Tools/pybench/Lists.py:339 test 0.248 0.248 25 25 25 25 Tools/pybench/Calls.py:445 f 0.228 0.228 22 22 22 22 Tools/pybench/Strings.py:247 calibrate 0.212 0.212 21 21 21 21 Tools/pybench/With.py:14 __exit__ Let's look at leading hot-spots: 1) yappi reports Calls.py line 439 as the hot spot. This refers to this code: def test(self): # <-- line 439 global f def f(x): if x > 1: return f(x-1) # <-- line 446 return 1 for i in xrange(self.rounds): f(10) f(10) ... Here yappi identified the outer function, the sampling profiler points directly to the line where most of the time is spent. 2) Most other hot spots reported by sampling_profiler lie in the calibrate functions. I wondered by that might be and noticed that pybench calls these 20 times per default. As the sampling_profiler reports the hot spots based on lines vs. the functions in yappi, this means we can not really compare these results. So let's compare the behaviour of the different profilers on this code: from time import time def fib(n): if n > 1: return fib(n-1) + fib(n-2) else: return n def fib2(n): stack = [n] while stack: n = stack.pop() if n > 1: stack.append(n-1) stack.append(n-2) s = time() fib(34) print "fib: %s seconds" % (time() - s) s = time() fib2(34) print "fib2: %s seconds" % (time() - s) Here are two variants of the fibonnaci function. The first works using recursion and returns the correct result. The second is just for simulating the same behaviour without recursion by using a local stack. The result is not computed since it is not required here - I just want to compare how this behaves wrt. our profilers. When I run this without any profiler, I get fib: 2.87187790871 seconds fib2: 5.26651716232 seconds With yappi, I get this: fib: 6.42756795883 seconds fib2: 14.6631779671 seconds name #n tsub ttot tavg demo.py.<module>:1 1 6.427689 21.090864 21.090864 demo.py.fib2:9 1 10.353305 14.663173 14.663173 ..atprof/Lib/posixpath.py.dirname:11 1 0.000004 0.000005 0.000005 ..prof/Lib/threading.py.setprofile:8 1 0.000001 0.000001 0.000001 ..n2.6/site-packages/yappi.py.stop:4 1 0.000000 0.000000 0.000000 demo.py.fib:3 184549.. 0.000000 0.000000 0.000000 The slow down for fib is about what I expected, but I have no idea why fib2 is slowed down that much. AFAIK, the usual profiling hooks of Python are called on entry and exit of each function, not on each line?! The results are incorrect for the fib function for reasons unknown to me - yappi thinks that no time is spent in that function, while the program output makes it obvious that more than 6 seconds are spent there. I tried using the timing_sample parameter to yappi.start. When set to 2, all of tsub, ttot and tavg are 0 in the output for this example. I can speed up the run to 4.97 seconds for fib and 9.33 s for fib2 using a high timing_sample value of 200 or more. That is still quite some overhead for keeping it active at all times. With the sampling profiler, the output is fib: 3.10812282562 seconds fib2: 5.91388201714 seconds Thread MainThread (902 samples, 9.036564 seconds) ------------------------------------------------------------------------ cpu_time (cum) ticks (cum) samples (cum) filename:lineno function 0.000 9.005 0 902 0 902 ython-work/statprof/Lib/runpy.py:122 _run_module_as_main 0.000 9.005 0 902 0 902 tatprof/Lib/sampling_profiler.py:203 main 0.000 9.005 0 902 0 902 tatprof/Lib/sampling_profiler.py:209 <module> 0.000 9.005 0 902 0 902 ython-work/statprof/Lib/runpy.py:34 _run_code 0.000 5.912 0 592 0 592 demo.py:21 <module> 2.068 3.092 211 310 211 310 demo.py:5 fib 0.000 3.092 0 310 0 310 demo.py:18 <module> 2.196 2.196 217 217 217 217 demo.py:12 fib2 1.576 1.576 157 157 157 157 demo.py:15 fib2 0.980 0.980 101 101 101 101 demo.py:14 fib2 0.900 0.900 87 87 87 87 demo.py:4 fib 0.644 0.644 67 67 67 67 demo.py:11 fib2 0.516 0.516 50 50 50 50 demo.py:13 fib2 0.124 0.124 12 12 12 12 demo.py:7 fib Just a small slow down here. And as you can see, this even tells you on what lines the most cpu time is spent. While I look at this, I am missing per-function totals, but that should be easy to add. As you can see, the profiler correctly detects that line 21 and line 18 of the demo code take most of the time (which are the module-level calls to fib2 and fib). For completeness, let's look what cProfile finds out: torsten@pulsar:~/python-work/statprof$ ./python run_cprofile.py demo.py fib: 5.83289813995 seconds fib2: 11.2586638927 seconds 55364794 function calls (36909866 primitive calls) in 17.092 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 17.092 17.092 <string>:1(<module>) 1 0.000 0.000 17.092 17.092 demo.py:1(<module>) 18454929/1 5.833 0.000 5.833 5.833 demo.py:3(fib) 1 7.831 7.831 11.259 11.259 demo.py:9(fib2) 18454928 1.421 0.000 1.421 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 18454929 2.007 0.000 2.007 0.000 {method 'pop' of 'list' objects} 4 0.000 0.000 0.000 0.000 {time.time} I am impressed how cProfile detects the recursive calls of the fib function (I guess that's the info in the ncalls column). Anyway, the slow down here is much more noticeable for the same information. So, why would I prefer the sampling profiler? a) Less slow down (could still use some optimization) b) Points directly to the code lines where the hot spots are (based on equidistant time samples). c) Does not distort run time by being active. Greetings, Torsten
participants (2)
-
Sümer Cip
-
Torsten Landschoff