Experiences with Microbenchmarking

Hi,
A colleague has just pointed me to the discussions on this list regarding benchmarking methodology. Over the past few months we have been devising an "as rigorous as possible" micro-benchmarking experiment. It seems there's a lot of crossover in our work and your discussions.
In short, our experiment is investigating the warmup behaviours of JITted VMs (currently PyPy, HotSpot, Graal, LuaJIT, HHVM, JRubyTruffle and V8) using microbenchmarks. For each microbenchmark/VM pairing we sequentially run a number of processes (currently 10), and within each process we run 2000 iterations of the microbenchmark. We then plot the results and make observations.
The experiments were run under our own "paranoid" benchmark runner (Krun), which aims to control as many confounding variables as are practically possible. Amongst others, it checks that all benchmarks are run with the system at a similar starting temperature, disables ASLR, uses a monotonic system clock (in some cases we had to patch VMs) and it reboots the system before each benchmark. We did not isolate CPUs, since we found that this creates artificial contention on multi-threaded VMs, however, we did use (and Krun checks for) a tickless Linux kernel.
We expected to see typical warmup behaviours (with distinct phases for profiling, compilation, and peak performance), but in reality we saw all kinds of crazy behaviours and even slowdowns.
We've published a draft paper showing our preliminary findings here: http://arxiv.org/abs/1602.00602
The draft shows a subset of our results. Run-sequence plots for all process executions can be found here: https://archive.org/download/softdev_warmup_experiment_artefacts/v0.1/all_gr...
For the final version of the paper we are trying to devise statistical methods to automatically classify the strange warmup behaviours we encountered. We will also run CPython in our final experiment, which may interest you guys :)
If this interests anyone, I'd be happy to discuss further.
Cheers
-- Best Regards Edd Barrett

Hi Edd,
On Fri, Feb 12, 2016 at 12:18 PM, Edd Barrett <edd@theunixzoo.co.uk> wrote:
JITted VMs (currently PyPy, HotSpot, Graal, LuaJIT, HHVM, JRubyTruffle and V8) using microbenchmarks. For each microbenchmark/VM pairing we sequentially run a number of processes (currently 10), and within each process we run 2000 iterations of the microbenchmark. We then plot the results and make observations.
PyPy typically needs more than 2000 iterations to be warmed up.
A bientôt,
Armin.
participants (2)
-
Armin Rigo
-
Edd Barrett