[pypy-dev] performance benchmark suite

Wed Apr 5 16:32:20 EDT 2017

Hi,

I'm working on speed.python.org, CPython benchmark. I reworked the
benchmark suite which is now called "performance":

   http://pyperformance.readthedocs.io/

performance contains 54 benchmarks and works on Python 2.7 and 3.x. It
creates a virtual environment with pinned versions of requirements to
"isolate" the benchmark from the system, to  get more reproductible
results. I added a few benchmarks from the PyPy benchmark suite but I
didn't add all of them yet.

performance is now based on my perf module. The perf module is a
toolkit to run, analyze and compare benchmarks:

   http://perf.readthedocs.io/

I would like to know how to adapt perf and performance to handle
correctly PyPy JIT compiler: I would like to measure the performance
when code has been optimized by the JIT compiler and ignore the warmup
phase. I already made a few changes in perf and performance when a JIT
is detected, but I'm not sure that I did them correctly.

My final goal would be to have PyPy benchmark results on
speed.python.org, to easily compare CPython and PyPy (using the same
benchmark runner, same physical server).

The perf module calibrates a benchmark based on time: it computes the
number of outer loops to get a timing of at least 100 ms. Basically, a
single value is computed as:

  t0 = perf.perf_counter()
  for _ in range(loops): func()
  value = perf.perf_counter() - t0

perf spawn a process only to calibrate the benchmark. On PyPy, it now
(in the master branch) spawns a second process only computing warmup
samples to validate the calibration. If a value becomes less than 100
ms, it doubles each time the number of loops. The opereation is
repeated until the number of loops doesn't change.

After the calibration, perf spawns worker processes sequentially: each
worker computes warmup samples and then compute values.

By default, each worker computes 1 warmup sample and 3 samples on
CPython, and 10 warmup samples an 10 samples on PyPy.

The configuration for PyPy is kind of arbitrary, wheras it was finely
tuned for CPython.

At the end, perf ignores all warmup samples and only computes the mean
and standard deviations of other values. For example, on CPython 21
processes are spawned: 1 calibration + 20 workers, each worker
computes 1 warmup + 3 values: compute the mean of 60 values.

perf stores all data in a JSON file: metadata (hostname, CPU speed,
system load, etc.), number of loops, warmup samples, samples, etc. It
provides an API to access all data.

perf also contains a lot of tools to analyze data: statistics (min,
max, median/MAD, percentiles, ...), render an histogram, compare
results and check if the difference is significant, detect unstable
benchmark, etc.

perf also contains a documentation explaining how to: run benchmark,
analyze benchmarks, get stable/reproductible results, tune your system
to run a benchmark, etc.

To tune your system for benchmarks, run the "sudo python3 -m perf
system tune" command. It configures the CPU (disable Turbo Boost, set
a fixed frequency, ...), check that the power cable is plugged, set
CPU affinity on IRQs, disable Linux perf events, etc. The command
reduces the operating system jitter.

Victor