Hi,
I started to write blog posts on stable benchmarks:
- https://haypo.github.io/journey-to-stable-benchmark-system.html
- https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
- https://haypo.github.io/journey-to-stable-benchmark-average.html
One important point is that minimum is commonly used in Python benchmarks, whereas it is a bad practice to get a stable benchmark.
I started to work on a toolkit to write benchmarks, the new "perf" module: http://perf.readthedocs.io/en/latest/ https://github.com/haypo/perf
I used timeit as a concrete use case, since timeit is popular and badly implemented. timeit currently uses 1 process running the microbenchmarks 3 times and take the minimum. timeit is *known* to be unstable, and the common advice is to run it at least 3 times and again take the minimum of the minimum.
Example of links about timeit being unstable:
- https://mail.python.org/pipermail/python-dev/2012-August/121379.html
- https://bugs.python.org/issue23693
- https://bugs.python.org/issue6422 (not directly related)
Moreover, the timeit module disables the garbage collector which is also wrong. It's wrong because it's rare to disable the GC in applications.
My goal for the perf module is to provide basic features and then reuse it in existing benchmarks:
- mean() and stdev() to display result
- clock chosen for benchmark
- result classes to store numbers
- etc.
Work in progress:
- new implementation of timeit using multiple processes
- perf.metadata module: collect various information about Python, the system, etc.
- file format to store numbers and metadata
I'm interested by the very basic perf.py internal text format: one timing per line, that's all. But it's incomplete, the "loops" informaiton is not stored. Maybe a binary format is better? I don't know yet.
It should be possible to cumulate files of multiple processes. I'm also interested to implement a generic "rerun" command to add more samples if a benchmark doesn't look stable enough.
perf.timeit looks more stable than timeit, the CLI is basically the same: replace "-m timeit" with "-m perf.timeit".
5 timeit output ("1000000 loops, best of 3: ... per loop"):
- 0.247 usec
- 0.252 usec
- 0.247 usec
- 0.251 usec
- 0.251 usec
It's disturbing to get 3 different "minimums" :-/
5 perf.timeit outputs ("Average: 25 runs x 3 samples x 10^6 loops: ..."):
- 250 ns +- 3 ns
- 250 ns +- 3 ns
- 251 ns +- 3 ns
- 251 ns +- 4 ns
- 251 ns +- 3 ns
Note: I also got " 258 ns +- 17 ns" when I opened a webpage in Firefox while the benchmark is running.
Note: I ran these benchmarks on a regular Linux without any specific tuning. ASLR is enabled, but the system was idle.
Victor