[Python-Dev] Python startup time

Tue Oct 9 17:02:02 EDT 2018

On 5/1/2018 8:26 PM, Gregory Szorc wrote:
> On 7/19/2017 12:15 PM, Larry Hastings wrote:
>>
>>
>> On 07/19/2017 05:59 AM, Victor Stinner wrote:
>>> Mercurial startup time is already 45.8x slower than Git whereas tested
>>> Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
>>> developers, with a startup time 2x - 3x slower...
>>
>> When Matt Mackall spoke at the Python Language Summit some years back, I
>> recall that he specifically complained about Python startup time.  He
>> said Python 3 "didn't solve any problems for [them]"--they'd already
>> solved their Unicode hygiene problems--and that Python's slow startup
>> time was already a big problem for them.  Python 3 being /even slower/
>> to start was absolutely one of the reasons why they didn't want to upgrade.
>>
>> You might think "what's a few milliseconds matter".  But if you run
>> hundreds of commands in a shell script it adds up.  git's speed is one
>> of the few bright spots in its UX, and hg's comparative slowness here is
>> a palpable disadvantage.
>>
>>
>>> So please continue efforts for make Python startup even faster to beat
>>> all other programming languages, and finally convince Mercurial to
>>> upgrade ;-)
>>
>> I believe Mercurial is, finally, slowly porting to Python 3.
>>
>>     https://www.mercurial-scm.org/wiki/Python3
>>
>> Nevertheless, I can't really be annoyed or upset at them moving slowly
>> to adopt Python 3, as Matt's objections were entirely legitimate.
> 
> I just now found found this thread when searching the archive for
> threads about startup time. And I was searching for threads about
> startup time because Mercurial's startup time has been getting slower
> over the past few months and this is causing substantial pain.
> 
> As I posted back in 2014 [1], CPython's startup overhead was >10% of the
> total CPU time in Mercurial's test suite. And when you factor in the
> time to import modules that get Mercurial to a point where it can run
> commands, it was more like 30%!
> 
> Mercurial's full test suite currently runs `hg` ~25,000 times. Using
> Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for
> 3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s
> on 3.7/master. Even if you divide this by the number of available CPU
> cores, we're talking dozens of seconds of wall time just waiting for
> CPython to get to a place where Mercurial's first bytecode can execute.
> 
> And the problem is worse when you factor in the time it takes to import
> Mercurial's own modules.
> 
> As a concrete example, I recently landed a Mercurial patch [2] that
> stubs out zope.interface to prevent the import of 9 modules on every
> `hg` invocation. This "only" saved ~6.94ms for a typical `hg`
> invocation. But this decreased the CPU time required to run the test
> suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a
> reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)!
> 
> By the time CPython gets Mercurial to a point where we can run useful
> code, we've already blown most of or past the time budget where humans
> perceive an action/command as instantaneous. If you ignore startup
> overhead, Mercurial's performance compares quite well to Git's for many
> operations. But the reality is that CPython startup overhead makes it
> look like Mercurial is non-instantaneous before Mercurial even has the
> opportunity to execute meaningful code!
> 
> Mercurial provides a `chg` program that essentially spins up a daemon
> `hg` process running a "command server" so the `chg` program [written in
> C - no startup overhead] can dispatch commands to an already-running
> Python/`hg` process and avoid paying the startup overhead cost. When you
> run Mercurial's test suite using `chg`, it completes *minutes* faster.
> `chg` exists mainly as a workaround for slow startup overhead.
> 
> Changing gears, my day job is maintaining Firefox's build system. We use
> Python heavily in the build system. And again, Python startup overhead
> is problematic. I don't have numbers offhand, but we invoke likely a few
> hundred Python processes as part of building Firefox. It should be
> several thousand. But, we've had to "hack" parts of the build system to
> "batch" certain build actions in single process invocations in order to
> avoid Python startup overhead. This undermines the ability of some build
> tools to formulate a reasonable understanding of the DAG and it causes a
> bit of pain for build system developers and makes it difficult to
> achieve "no-op" and fast incremental builds because we're always
> invoking certain Python processes because we've had to move DAG
> awareness out of the build backend and into Python. At some point, we'll
> likely replace Python code with Rust so the build system is more "pure"
> and easier to maintain and reason about.
> 
> I've seen posts in this thread and elsewhere in the CPython development
> universe that challenge whether milliseconds in startup time matter.
> Speaking as a Mercurial and Firefox build system developer,
> *milliseconds absolutely matter*. Going further, *fractions of
> milliseconds matter*. For Mercurial's test suite with its ~25,000 Python
> process invocations, 1ms translates to ~25s of CPU time. With 2.7,
> Mercurial can dispatch commands in ~50ms. When you load common
> extensions, it isn't uncommon to see process startup overhead of
> 100-150ms! A millisecond here. A millisecond there. Before you know it,
> we're talking *minutes* of CPU (and potentially wall) time in order to
> run Mercurial's test suite (or build Firefox, or ...).
> 
> From my perspective, Python process startup and module import overhead
> is a severe problem for Python. I don't say this lightly, but in my mind
> the problem causes me to question the viability of Python for popular
> use cases, such as CLI applications. When choosing a programming
> language, I want one that will scale as a project grows. Vanilla process
> overhead has Python starting off significantly slower than compiled code
> (or even Perl) and adding module import overhead into the mix makes
> Python slower and slower as projects grow. As someone who has to deal
> with this slowness on a daily basis, I can tell you that it is extremely
> frustrating and it does matter. I hope that the importance of the
> problem will be acknowledged (milliseconds *do* matter) and that
> creative minds will band together to address it. Since I am
> disproportionately impacted by this issue, if there's anything I can do
> to help, let me know.

We were debugging abysmally slow execution of Mercurial's test harness
on macOS and we discovered a new wrinkle to the startup time problem.

It appears that APFS acquires some shared locks/mutexes in the kernel
when executing readdir() and other filesystem system calls. When you
have several Python processes all starting at the same time, I/O
attached to module importing (import.c:case_ok() by the looks of it for
Python 2.7) becomes a stress test of sorts for this lock acquisition. On
my 6+6 core MacBook Pro, ~75% of overall system CPU is spent in the
kernel when executing the test harness with 12 parallel tests.

If we run the test harness with the persistent `chg` command server
(which eliminates Python process startup overhead), wall execution time
drops from ~37:43s to ~9:06s.

This problem of shared locks on read-only operations appears to be
similar to that of AUFS, which I've blogged about [1].

It is pretty common for non-compiled languages (like Python, Ruby, PHP,
Perl, etc) to stat() the world as part of looking for modules to load.
Typically, the filesystem's stat cache will save you and the overhead
from hundreds or thousands of lookups is trivial (after first load). But
it appears APFS is quite sensitive to it. Any work to reduce the number
of filesystem API calls during Python startup will likely have a
profound impact on APFS when multiple Python processes are starting. A
"frozen" application where modules are in a shared container file is
probably ideal.

Python 3.7 doesn't exhibit as much of a problem. But it is still there.
A brief audit of the importer code and call stacks confirms it is the
same problem - just less prevalent. Wall time execution of the test
harness from Python 2.7 to Python 3.7 drops from ~37:43s to ~20:39.
Overall kernel CPU time drops from ~75% to ~19%. And that wall time
improvement is despite Python 3's slower process startup. So locking in
the kernel is really a killer on Python 2.7.

While we're here, CPython might want to look into getdirentriesattr() as
a replacement for readdir(). We switched to it in Mercurial several
years ago to make `hg status` operations significantly faster [2]. I'm
not sure if it will yield a speedup on APFS though. But it's worth a
try. (If it does, you could probably make
os.listdir()/os.scandir()/os.walk() significantly faster on macOS.)

I hope someone finds this information useful to further improving
[startup] performance. (And given that Python 3.7 is substantially
faster by avoiding excessive readdir(), I wouldn't be surprised if this
problem is already known!)

[1] https://gregoryszorc.com/blog/2017/12/08/good-riddance-to-aufs/
[2] https://www.mercurial-scm.org/repo/hg/rev/05ccfe6763f1