Hello,
Yesterday Neil Schemenauer mentioned some work that a colleague of mine (CCed) and I have done to improve CPython start-up time. Given the recent discussion, it seems timely to discuss what we are doing and whether it is of interest to other people hacking on the CPython runtime.
There are many ways to reduce the start-up time overhead. For this experiment, we are specifically targeting the cost of unmarshaling heap objects from compiled Python bytecode. Our measurements show this specific cost to represent 10% to 25% of the start-up time among the applications we have examined.
Our approach to eliminating this overhead is to store unmarshaled objects into the data segment of the python executable. We do this by processing the compiled python bytecode for a module, creating native object code with the unmarshaled objects in their in-memory representation, and linking this into the python executable.
When a module is imported, we simply return a pointer to the top-level code object in the data segment directly without invoking the unmarshaling code or touching the file system. What we are doing is conceptually similar to the existing capability to freeze a module, but we avoid non-trivial unmarshaling costs.
The patch is still under development and there is still a little bit more work to do. With that in mind, the numbers look good but please take these with a grain of salt
Baseline
$ bench "./python.exe -c ''"
benchmarking ./python.exe -c ''
time 31.46 ms (31.24 ms .. 31.78 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 32.08 ms (31.82 ms .. 32.63 ms)
std dev 778.1 μs (365.6 μs .. 1.389 ms)
$ bench "./python.exe -c 'import difflib'"
benchmarking ./python.exe -c 'import difflib'
time 32.82 ms (32.64 ms .. 33.02 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 33.17 ms (33.01 ms .. 33.44 ms)
std dev 430.7 μs (233.8 μs .. 675.4 μs)
With our patch
$ bench "./python.exe -c ''"
benchmarking ./python.exe -c ''
time 24.86 ms (24.62 ms .. 25.08 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 25.58 ms (25.36 ms .. 25.94 ms)
std dev 592.8 μs (376.2 μs .. 907.8 μs)
$ bench "./python.exe -c 'import difflib'"
benchmarking ./python.exe -c 'import difflib'
time 25.30 ms (25.00 ms .. 25.55 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 26.78 ms (26.30 ms .. 27.64 ms)
std dev 1.413 ms (747.5 μs .. 2.250 ms)
variance introduced by outliers: 20% (moderately inflated)
Here are some numbers with the patch but with the stat calls preserved to isolate just the marshaling effects
Baseline
$ bench "./python.exe -c 'import difflib'"
benchmarking ./python.exe -c 'import difflib'
time 34.67 ms (33.17 ms .. 36.52 ms)
0.995 R² (0.990 R² .. 1.000 R²)
mean 35.36 ms (34.81 ms .. 36.25 ms)
std dev 1.450 ms (1.045 ms .. 2.133 ms)
variance introduced by outliers: 12% (moderately inflated)
With our patch (and calls to stat)
$ bench "./python.exe -c 'import difflib'"
benchmarking ./python.exe -c 'import difflib'
time 30.24 ms (29.02 ms .. 32.66 ms)
0.988 R² (0.968 R² .. 1.000 R²)
mean 31.86 ms (31.13 ms .. 32.75 ms)
std dev 1.789 ms (1.329 ms .. 2.437 ms)
variance introduced by outliers: 17% (moderately inflated)
(This work was done in CPython 3.6 and we are exploring back-porting to 2.7 so we can run the hg startup benchmarks in the performance test suite.)
This is effectively a drop-in replacement for the frozen module capability and (so far) required only minimal changes to the runtime. To us, it seems like a very nice win without compromising on compatibility or complexity. I am happy to discuss more of the technical details until we have a public patch available.
I hope this provides some optimism around the possibility of improving the start-up time of CPython. What do you all think?
Kindly,
Carl