[Speed] When CPython performance depends on dead code...

Thu Apr 28 04:27:11 EDT 2016

Hi,

2016-04-27 20:30 GMT+02:00 Brett Cannon <brett at python.org>:
> My first intuition is some cache somewhere is unhappy w/ the varying sizes.
> Have you tried any of this on another machine to see if the results are
> consistent?

On my laptop, the performance when I add deadcode doesn't seem to
change much: the delta is smaller than 1%.

I found a fix for my deadcode issue! Use "make profile-opt" rather
than "make". Using PGO, GCC reorders hot functions to make them
closer. I also read that it records statistics on branches to emit
first the most frequent branch.

I also modified bm_call_simple.py to use multiple processes and to use
random hash seeds, rather than using a single process and disabling
hash randomization.

Comparison reference => fastcall (my whole fork, not just the tiny
patches adding deadcode) using make (gcc -O3):

    Average: 1183.5 ms +/- 6.1 ms (min: 1173.3 ms, max: 1201.9 ms) -
15 processes x 5 loops
=> Average: 1121.2 ms +/- 7.4 ms (min: 1106.5 ms, max: 1142.0 ms) - 15
processes x 5 loops

Comparison reference => fastcall using make profile-opt (PGO):

   Average: 962.7 ms +/- 17.8 ms (min: 952.6 ms, max: 998.6 ms) - 15
processes x 5 loops
=> Average: 961.1 ms +/- 18.6 ms (min: 949.0 ms, max: 1011.3 ms) - 15
processes x 5 loops

Using make, fastcall *seems* to be faster, but in fact it looks more
like random noise of deadcode. Using PGO, fastcall doesn't change
performance at all. I expected fastcall to be faster, but it's the
purpose of benchmarks: get real performance, not expectations :-)

Next step: modify most benchmarks of perf.py to run multiple processes
rather than a single process to test using multiple hash seeds.

Victor