2016-04-27 20:30 GMT+02:00 Brett Cannon email@example.com:
My first intuition is some cache somewhere is unhappy w/ the varying sizes. Have you tried any of this on another machine to see if the results are consistent?
On my laptop, the performance when I add deadcode doesn't seem to change much: the delta is smaller than 1%.
I found a fix for my deadcode issue! Use "make profile-opt" rather than "make". Using PGO, GCC reorders hot functions to make them closer. I also read that it records statistics on branches to emit first the most frequent branch.
I also modified bm_call_simple.py to use multiple processes and to use random hash seeds, rather than using a single process and disabling hash randomization.
Comparison reference => fastcall (my whole fork, not just the tiny patches adding deadcode) using make (gcc -O3):
Average: 1183.5 ms +/- 6.1 ms (min: 1173.3 ms, max: 1201.9 ms) -
15 processes x 5 loops => Average: 1121.2 ms +/- 7.4 ms (min: 1106.5 ms, max: 1142.0 ms) - 15 processes x 5 loops
Comparison reference => fastcall using make profile-opt (PGO):
Average: 962.7 ms +/- 17.8 ms (min: 952.6 ms, max: 998.6 ms) - 15 processes x 5 loops => Average: 961.1 ms +/- 18.6 ms (min: 949.0 ms, max: 1011.3 ms) - 15 processes x 5 loops
Using make, fastcall *seems* to be faster, but in fact it looks more like random noise of deadcode. Using PGO, fastcall doesn't change performance at all. I expected fastcall to be faster, but it's the purpose of benchmarks: get real performance, not expectations :-)
Next step: modify most benchmarks of perf.py to run multiple processes rather than a single process to test using multiple hash seeds.