I started to write blog posts on stable benchmarks:
1) https://haypo.github.io/journey-to-stable-benchmark-system.html 2) https://haypo.github.io/journey-to-stable-benchmark-deadcode.html 3) https://haypo.github.io/journey-to-stable-benchmark-average.html
One important point is that minimum is commonly used in Python benchmarks, whereas it is a bad practice to get a stable benchmark.
I used timeit as a concrete use case, since timeit is popular and badly implemented. timeit currently uses 1 process running the microbenchmarks 3 times and take the minimum. timeit is known to be unstable, and the common advice is to run it at least 3 times and again take the minimum of the minimum.
Example of links about timeit being unstable:
Moreover, the timeit module disables the garbage collector which is also wrong. It's wrong because it's rare to disable the GC in applications.
My goal for the perf module is to provide basic features and then reuse it in existing benchmarks:
Work in progress:
I'm interested by the very basic perf.py internal text format: one timing per line, that's all. But it's incomplete, the "loops" informaiton is not stored. Maybe a binary format is better? I don't know yet.
It should be possible to cumulate files of multiple processes. I'm also interested to implement a generic "rerun" command to add more samples if a benchmark doesn't look stable enough.
perf.timeit looks more stable than timeit, the CLI is basically the same: replace "-m timeit" with "-m perf.timeit".
5 timeit output ("1000000 loops, best of 3: ... per loop"):
It's disturbing to get 3 different "minimums" :-/
5 perf.timeit outputs ("Average: 25 runs x 3 samples x 10^6 loops: ..."):
Note: I also got " 258 ns +- 17 ns" when I opened a webpage in Firefox while the benchmark is running.
Note: I ran these benchmarks on a regular Linux without any specific tuning. ASLR is enabled, but the system was idle.