Mailman 3 Python startup time - Python-Dev

Python startup time

Victor Stinner

July 19, 2017

12:59 p.m.

Hi, On Twitter, Raymond Hettinger wrote: "The decision making process on Python-dev is an anti-pattern, governed by anecdotal data and ambiguity over what problem is solved." https://twitter.com/raymondh/status/887069454693158912 About "anecdotal data", I would like to discuss the Python startup time. == Python 3.7 compared to 2.7 == First of all, on speed.python.org, we have: * Python 2.7: 6.4 ms with site, 3.0 ms without site (-S) * master (3.7): 14.5 ms with site, 8.4 ms without site (-S) Python 3.7 startup time is 2.3x slower with site (default mode), or 2.8x slower without site (-S command line option). (I will skip Python 3.4, 3.5 and 3.6 which are much worse than Python 3.7...) So if an user complained about Python 2.7 startup time: be prepared for a 2x - 3x more angry user when "forced" to upgrade to Python 3! == Mercurial vs Git, Python vs C, startup time == Startup time matters a lot for Mercurial since Mercurial is compared to Git. Git and Mercurial have similar features, but Git is written in C whereas Mercurial is written in Python. Quick benchmark on the speed.python.org server: * hg version: 44.6 ms +- 0.2 ms * git --version: 974 us +- 7 us Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower... I tested Mecurial 3.7.3 and Git 2.7.4 on Ubuntu 16.04.1 using "python3 -m perf command -- ...". == CPython core developers don't care? no, they do care == Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me (Victor Stinner) and other core developers made multiple changes last years to reduce the number of imports at startup, optimize impotlib, etc. IHMO all these core developers are well aware of the competition of programming languages, and honesty Python startup time isn't "good". So let's compare it to other programming languages similar to Python. == PHP, Ruby, Perl == I measured the startup time of other programming languages which are similar to Python, still on the speed.python.org server using "python3 -m perf command -- ...": * perl -e ' ': 1.18 ms +- 0.01 ms * php -r ' ': 8.57 ms +- 0.05 ms * ruby -e ' ': 32.8 ms +- 0.1 ms Wow, Perl is quite good! PHP seems as good as Python 2 (but Python 3 is worse). Ruby startup time seems less optimized than other languages. Tested versions: * perl 5, version 22, subversion 1 (v5.22.1) * PHP 7.0.18-0ubuntu0.16.04.1 (cli) ( NTS ) * ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] == Quick Google search == I also searched for "python startup time" and "python slow startup time" on Google and found many articles. Some examples: "Reducing the Python startup time" http://www.draketo.de/book/export/html/498 => "The python startup time always nagged me (17-30ms) and I just searched again for a way to reduce it, when I found this: The Python-Launcher caches GTK imports and forks new processes to reduce the startup time of python GUI programs." https://nelsonslog.wordpress.com/2013/04/08/python-startup-time/ => "Wow, Python startup time is worse than I thought." "How to speed up python starting up and/or reduce file search while loading libraries?" https://stackoverflow.com/questions/15474160/how-to-speed-up-python-starting... => "The first time I log to the system and start one command it takes 6 seconds just to show a few line of help. If I immediately issue the same command again it takes 0.1s. After a couple of minutes it gets back to 6s. (proof of short-lived cache)" "How does one optimise the startup of a Python script/program?" https://www.quora.com/How-does-one-optimise-the-startup-of-a-Python-script-p... => "I wrote a Python program that would be used very often (imagine 'cd' or 'ls') for very short runtimes, how would I make it start up as fast as possible?" "Python Interpreter Startup time" https://bytes.com/topic/python/answers/34469-pyhton-interpreter-startup-time "Python is very slow to start on Windows 7" https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on... => "Python takes 17 times longer to load on my Windows 7 machine than Ubuntu 14.04 running on a VM" => "returns in 0.614s on Windows and 0.036s on Linux" "How to make a fast command line tool in Python" (old article Python 2.5.2) https://files.bemusement.org/talks/OSDC2008-FastPython/ => "(...) some techniques Bazaar uses to start quickly, such as lazy imports." -- So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-) Victor

Show replies by date

Oleg Broytman

July 2017

1:22 p.m.

On Wed, Jul 19, 2017 at 02:59:52PM +0200, Victor Stinner <victor.stinner@gmail.com> wrote:

...

"Python is very slow to start on Windows 7" https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on...

However hard you are going to optimize Python you cannot fix those "defenders", "guards" and "protectors". :-) This particular link can be excluded from consideration. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Victor Stinner

2:26 p.m.

2017-07-19 15:22 GMT+02:00 Oleg Broytman <phd@phdru.name>:

...

On Wed, Jul 19, 2017 at 02:59:52PM +0200, Victor Stinner <victor.stinner@gmail.com> wrote:

...
"Python is very slow to start on Windows 7" https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on...

However hard you are going to optimize Python you cannot fix those "defenders", "guards" and "protectors". :-) This particular link can be excluded from consideration.

Sorry, I didn't read carefully each link I posted. Even for me knowing what Python does at startup, it's hard to explain why 3 people have different timing: 15 ms, 75 ms and 300 ms for example. In my experience, the following things impact Python startup: * -S option: loading or not the site module * Paths in sys.path: PYTHONPATH environment variable for example * .pth files files in sys.path * Python running in a virtual environment or not * Operating system: Python loads different modules at startup depending on the OS. Naoki INADA just removed _osx_support from being imported in the site module on macOS for example. My list is likely incomplete. In the performance benchmark suite, a controlled virtual environment is created to have a known set of modules. FYI running Python is a virtual environment is slower than "system" python which runs outside a virtual environment... Victor

Cesare Di Mauro

5:09 p.m.

2017-07-19 16:26 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:

...

2017-07-19 15:22 GMT+02:00 Oleg Broytman <phd@phdru.name>:

...
On Wed, Jul 19, 2017 at 02:59:52PM +0200, Victor Stinner < victor.stinner@gmail.com> wrote:

...
"Python is very slow to start on Windows 7" https://stackoverflow.com/questions/29997274/python-is- very-slow-to-start-on-windows-7

However hard you are going to optimize Python you cannot fix those "defenders", "guards" and "protectors". :-) This particular link can be excluded from consideration.

Sorry, I didn't read carefully each link I posted. Even for me knowing what Python does at startup, it's hard to explain why 3 people have different timing: 15 ms, 75 ms and 300 ms for example. In my experience, the following things impact Python startup:

* -S option: loading or not the site module * Paths in sys.path: PYTHONPATH environment variable for example * .pth files files in sys.path * Python running in a virtual environment or not * Operating system: Python loads different modules at startup depending on the OS. Naoki INADA just removed _osx_support from being imported in the site module on macOS for example.

My list is likely incomplete.

In the performance benchmark suite, a controlled virtual environment is created to have a known set of modules. FYI running Python is a virtual environment is slower than "system" python which runs outside a virtual environment...

Victor

Hi Victor,

I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules). Bests, Cesare <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail priva di virus. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Victor Stinner

5:23 p.m.

2017-07-20 19:09 GMT+02:00 Cesare Di Mauro <cesare.di.mauro@gmail.com>:

...

I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

My benchmark was "python3 -m perf command -- python3 -c pass": I don't explicitly remove .pyc files, I expect that Python uses prebuilt .pyc files from __pycache__. Victor

Cesare Di Mauro

7:38 p.m.

2017-07-20 19:23 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:

...

2017-07-20 19:09 GMT+02:00 Cesare Di Mauro <cesare.di.mauro@gmail.com>:

...
I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

My benchmark was "python3 -m perf command -- python3 -c pass": I don't explicitly remove .pyc files, I expect that Python uses prebuilt .pyc files from __pycache__.

Victor

OK, that should be the best case. An idea to improve the situation might be to find an alternative structure for .pyc/pyo files, which allows to (partially) "parallelize" their loading (not execution, of course), or at least speed-up the process. Maybe a GSoC project for some student, if no core dev has time to investigate it. Cesare <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail priva di virus. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Nick Coghlan

2:44 a.m.

On 21 July 2017 at 05:38, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:

...

2017-07-20 19:23 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:

...
2017-07-20 19:09 GMT+02:00 Cesare Di Mauro <cesare.di.mauro@gmail.com>:

...
I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

My benchmark was "python3 -m perf command -- python3 -c pass": I don't explicitly remove .pyc files, I expect that Python uses prebuilt .pyc files from __pycache__.

Victor

OK, that should be the best case.

An idea to improve the situation might be to find an alternative structure for .pyc/pyo files, which allows to (partially) "parallelize" their loading (not execution, of course), or at least speed-up the process. Maybe a GSoC project for some student, if no core dev has time to investigate it.

Unmarshalling the code object from disk generally isn't the slow part - it's the module level execution that takes time. Using the typing module as an example, a full reload cycle takes almost 10 milliseconds: $ python3 -m perf timeit -s "import typing; from importlib import reload" "reload(typing)" ..................... Mean +- std dev: 9.89 ms +- 0.46 ms (Don't try timing "import typing" directly - the sys.modules cache amortises the cost down to being measured in nanoseconds, since you're effectively just measuring the speed of a dict lookup) We can separately measure the cost of unmarshalling the code object: $ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us Finding the module spec: $ python3 -m perf timeit -s "from importlib.util import find_spec" "find_spec('typing')" ..................... Mean +- std dev: 69.2 us +- 2.3 us And actually running the module's code (this includes unmarshalling the code object, but *not* calculating the import spec): $ python3 -m perf timeit -s "import typing; loader_exec = typing.__spec__.loader.exec_module" "loader_exec(typing)" ..................... Mean +- std dev: 9.68 ms +- 0.43 ms Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

2:52 a.m.

On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

We can separately measure the cost of unmarshalling the code object:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode and reading it from disk should really be accounted for in each iteration: $ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(typing.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 337 us +- 8 us That will have a bigger impact when loading from spinning disk or a network drive, but it's fairly negligible when loading from a local SSD or an already primed filesystem cache. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Cesare Di Mauro

5:30 a.m.

2017-07-21 4:52 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:

...

On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
We can separately measure the cost of unmarshalling the code object:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode and reading it from disk should really be accounted for in each iteration:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(typing.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 337 us +- 8 us

That will have a bigger impact when loading from spinning disk or a network drive, but it's fairly negligible when loading from a local SSD or an already primed filesystem cache.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal. I took a look at the typing module, and there are some small things that can be optimized, but it'll not change the overall situation unfortunately. Code execution can be improved. :) However, it requires a massive amount of time experimenting... Bests, Cesare <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail priva di virus. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Nick Coghlan

6:23 a.m.

On 21 July 2017 at 15:30, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:

...

2017-07-21 4:52 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:

...
On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
We can separately measure the cost of unmarshalling the code object:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode and reading it from disk should really be accounted for in each iteration:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(typing.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 337 us +- 8 us

That will have a bigger impact when loading from spinning disk or a network drive, but it's fairly negligible when loading from a local SSD or an already primed filesystem cache.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal.

It was still a good suggestion, since it made me realise I *hadn't* actually measured the relative timings lately, so it was technically an untested assumption that module level code execution still dominated the overall import time. typing is also a particularly large & complex module, and bytecode unmarshalling represents a larger fraction of the import time for simpler modules like abc: $ python3 -m perf timeit -s "import abc; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(abc.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 45.2 us +- 1.1 us $ python3 -m perf timeit -s "import abc; loader_exec = abc.__spec__.loader.exec_module" "loader_exec(abc)" ..................... Mean +- std dev: 172 us +- 5 us $ python3 -m perf timeit -s "import abc; from importlib import reload" "reload(abc)" ..................... Mean +- std dev: 280 us +- 14 us And _weakrefset: $ python3 -m perf timeit -s "import _weakrefset; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(_weakrefset.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 57.7 us +- 1.3 us $ python3 -m perf timeit -s "import _weakrefset; loader_exec = _weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)" ..................... Mean +- std dev: 129 us +- 6 us $ python3 -m perf timeit -s "import _weakrefset; from importlib import reload" "reload(_weakrefset)" ..................... Mean +- std dev: 226 us +- 4 us The conclusion still holds (the absolute numbers here are likely still too small for the extra complexity of parallelising bytecode loading to pay off in any significant way), but it also helps us set reasonable expectations around how much of a gain we're likely to be able to get just from precompilation with Cython. That does actually raise a small microbenchmarking problem: for source and bytecode imports, we can force the import system to genuinely rerun the module or unmarshal the bytecode inside a single Python process, allowing perf to measure it independently of CPython startup. While I'm pretty sure it's possible to trick the import machinery into rerunning module level init functions even for old-style extension modules (hence allowing us to run similar tests to those above for a Cython compiled module), I don't actually remember how to do it off the top of my head. Cheers, Nick. P.S. I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster. Being built directly into the interpreter binary is pretty much the best case scenario for reducing import overhead. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

David Mertz

7:12 a.m.

How implausible is it to write out the actual memory image of a loaded Python process? I.e. on a specific machine, OS, Python version, etc? This can only be overhead initially, of course, but on subsequent runs it's just one memory map, which the cheapest possible operation. E.g. $ python3.7 --write-image "import typing, re, os, numpy" I imagine this creating a file like: /tmp/__python__/python37-typing-re-os-numpy.mem Then just terminating as if just that line had run, however long it takes (but snapshotting before exit). Then subsequent invocations would only restore the image to memory. Maybe: $ pyrunner --load-image python37-typing-re-os-numpy myscript.py The last line could be aliased of course. I suppose we'd need to check if relevant file exists, and if not fall back to just ignoring the '--load-image' flag and running plain old Python. This helps not at all for something like AWS Lambda where each instance is spun up fresh. But for the use-case of running many Python shell commands at an interactive shell on one machine, it seems like that could be very fast. In my hypothetical I suppose pre-loading some collection of modules in the image. Of course, the script may need to load others, and it may not use some in the image. But users could decide their typical needed modules themselves under this idea. On Jul 20, 2017 11:27 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:

...

On 21 July 2017 at 15:30, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:

...
2017-07-21 4:52 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:

...
On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
We can separately measure the cost of unmarshalling the code object:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode and reading it from disk should really be accounted for in each iteration:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(typing.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 337 us +- 8 us

That will have a bigger impact when loading from spinning disk or a network drive, but it's fairly negligible when loading from a local SSD or an already primed filesystem cache.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal.

It was still a good suggestion, since it made me realise I *hadn't* actually measured the relative timings lately, so it was technically an untested assumption that module level code execution still dominated the overall import time.

typing is also a particularly large & complex module, and bytecode unmarshalling represents a larger fraction of the import time for simpler modules like abc:

$ python3 -m perf timeit -s "import abc; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(abc.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 45.2 us +- 1.1 us

$ python3 -m perf timeit -s "import abc; loader_exec = abc.__spec__.loader.exec_module" "loader_exec(abc)" ..................... Mean +- std dev: 172 us +- 5 us

$ python3 -m perf timeit -s "import abc; from importlib import reload" "reload(abc)" ..................... Mean +- std dev: 280 us +- 14 us

And _weakrefset:

$ python3 -m perf timeit -s "import _weakrefset; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(_weakrefset.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 57.7 us +- 1.3 us

$ python3 -m perf timeit -s "import _weakrefset; loader_exec = _weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)" ..................... Mean +- std dev: 129 us +- 6 us

$ python3 -m perf timeit -s "import _weakrefset; from importlib import reload" "reload(_weakrefset)" ..................... Mean +- std dev: 226 us +- 4 us

The conclusion still holds (the absolute numbers here are likely still too small for the extra complexity of parallelising bytecode loading to pay off in any significant way), but it also helps us set reasonable expectations around how much of a gain we're likely to be able to get just from precompilation with Cython.

That does actually raise a small microbenchmarking problem: for source and bytecode imports, we can force the import system to genuinely rerun the module or unmarshal the bytecode inside a single Python process, allowing perf to measure it independently of CPython startup. While I'm pretty sure it's possible to trick the import machinery into rerunning module level init functions even for old-style extension modules (hence allowing us to run similar tests to those above for a Cython compiled module), I don't actually remember how to do it off the top of my head.

Cheers, Nick.

P.S. I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster. Being built directly into the interpreter binary is pretty much the best case scenario for reducing import overhead.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ mertz%40gnosis.cx

Antoine Pitrou

8:54 a.m.

On Fri, 21 Jul 2017 00:12:20 -0700 David Mertz <mertz@gnosis.cx> wrote:

...

How implausible is it to write out the actual memory image of a loaded Python process? I.e. on a specific machine, OS, Python version, etc? This can only be overhead initially, of course, but on subsequent runs it's just one memory map, which the cheapest possible operation.

You can't rely on the file being remapped at the same address when you reload it. So you'd have to write a relocation routine that's able to find and fix *all* pointers inside the Python object tree and CPython's internal structures (fixing the pointers is not necessarily difficult, finding them without missing any is the difficult part). Regards Antoine.

INADA Naoki

9:28 a.m.

On Fri, Jul 21, 2017 at 4:12 PM, David Mertz <mertz@gnosis.cx> wrote:

...

How implausible is it to write out the actual memory image of a loaded Python process? I.e. on a specific machine, OS, Python version, etc? This can only be overhead initially, of course, but on subsequent runs it's just one memory map, which the cheapest possible operation.

FYI, you may be interested in very recent node.js security issue. https://nodejs.org/en/blog/vulnerability/july-2017-security-releases/#node-j...

Nikolaus Rath

11:25 a.m.

On Jul 21 2017, David Mertz <mertz@gnosis.cx> wrote:

...

How implausible is it to write out the actual memory image of a loaded Python process?

That is what Emacs does, and it causes them a lot of trouble. They're trying to move away from it at the moment, but the direction is not yet clear. The keyword is "unexec", and it wrecks havoc with malloc. Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

Barry Warsaw

10:21 p.m.

On Jul 21, 2017, at 01:25 PM, Nikolaus Rath wrote:

...

That is what Emacs does, and it causes them a lot of trouble. They're trying to move away from it at the moment, but the direction is not yet clear. The keyword is "unexec", and it wrecks havoc with malloc.

Emacs has been unexec'ing for as long as I can remember (which is longer than I can remember Python :). I know that it's been problematic and there have been many efforts over the years to replace it, but I think it's been a fairly successful technique in practice, at least on platforms that support it. That's another problem with the approach of course; it's not universally possible to implement. -Barry

Skip Montanaro

10:34 p.m.

Emacs has been unexec'ing for as long as I can remember (which is longer than I can remember Python :). I know that it's been problematic and there have been many efforts over the years to replace it, but I think it's been a fairly successful technique in practice, at least on platforms that support it. I've been using Emacs far longer than Python. I remember having to invoke temacs on something. Still, if I didn't know better, I could be convinced you were referring to the GIL. :-) Skip

David Mertz

10:53 p.m.

I would guess that Windows users don't tend to run lots of command line tools where startup time dominates, as *nix users do. On Fri, Jul 21, 2017 at 3:21 PM, Barry Warsaw <barry@python.org> wrote:

...

On Jul 21, 2017, at 01:25 PM, Nikolaus Rath wrote:

...
That is what Emacs does, and it causes them a lot of trouble. They're trying to move away from it at the moment, but the direction is not yet clear. The keyword is "unexec", and it wrecks havoc with malloc.

Emacs has been unexec'ing for as long as I can remember (which is longer than I can remember Python :). I know that it's been problematic and there have been many efforts over the years to replace it, but I think it's been a fairly successful technique in practice, at least on platforms that support it. That's another problem with the approach of course; it's not universally possible to implement.

-Barry _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ mertz%40gnosis.cx

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Paul Moore

8:13 a.m.

On 21 July 2017 at 23:53, David Mertz <mertz@gnosis.cx> wrote:

...

I would guess that Windows users don't tend to run lots of command line tools where startup time dominates, as *nix users do.

Well, in the sense that many Windows users don't use the command line at all, this is true. However, startup time is a definite problem for Windows users who *do* use the command line, because process creation cost is a lot higher than on Unix, so starting new commands is *already* costly, and therefore minimising additional overhead is crucial. It's a bit of a chicken and egg problem - Windows users avoid excessive command line program invocation because startup time is high, so no-one optimises startup time because Windows users don't use short-lived command line programs. But I'm seeing a trend away from that - more and more Windows tools these days seem to be comfortable spawning subprocesses. I don't know what prompted that trend. Paul

Alex Walters

8:37 a.m.

...

-----Original Message----- From: Python-Dev [mailto:python-dev-bounces+tritium- list=sdamon.com@python.org] On Behalf Of Paul Moore Sent: Saturday, July 22, 2017 4:14 AM To: David Mertz <mertz@gnosis.cx> Cc: Barry Warsaw <barry@python.org>; Python-Dev <python- dev@python.org> Subject: Re: [Python-Dev] Python startup time

...

It's a bit of a chicken and egg problem - Windows users avoid excessive command line program invocation because startup time is high, so no-one optimises startup time because Windows users don't use short-lived command line programs. But I'm seeing a trend away from that - more and more Windows tools these days seem to be comfortable spawning subprocesses. I don't know what prompted that trend.

The programs I see that are comfortable spawning processes willy-nilly on windows are mostly .net, which has a lot of the runtime assemblies cached by the OS in the GAC - if you are spawning a second processes of yourself, or something that uses the same libraries as you, the compile step on those can be skipped. Unless you are talking about python/non-.NET programs, in which case, I have no answer.

...

Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium- list%40sdamon.com

Steve Dower

2:21 p.m.

I believe the trend is due to language like Python and Node.js, most of which aggressively discourage threading (more from the broader community than the core languages, but I see a lot of apps using these now), and also the higher reliability afforded by out-of-process tasks (that is, one crash doesn’t kill the entire app – e.g browser tabs). Optimizing startup time is incredibly valuable, and having tried it a few times I believe that the import system (in essence, stat calls) is the biggest culprit. The tens of ms prior to the first user import can’t really go anywhere. Cheers, Steve Top-posted from my Windows phone From: Alex Walters Sent: Saturday, July 22, 2017 1:39 Cc: 'Python-Dev' Subject: Re: [Python-Dev] Python startup time

...

-----Original Message----- From: Python-Dev [mailto:python-dev-bounces+tritium- list=sdamon.com@python.org] On Behalf Of Paul Moore Sent: Saturday, July 22, 2017 4:14 AM To: David Mertz <mertz@gnosis.cx> Cc: Barry Warsaw <barry@python.org>; Python-Dev <python- dev@python.org> Subject: Re: [Python-Dev] Python startup time

...

It's a bit of a chicken and egg problem - Windows users avoid excessive command line program invocation because startup time is high, so no-one optimises startup time because Windows users don't use short-lived command line programs. But I'm seeing a trend away from that - more and more Windows tools these days seem to be comfortable spawning subprocesses. I don't know what prompted that trend.

...

Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium- list%40sdamon.com

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40python.org

Brett Cannon

5:17 p.m.

On Sat, Jul 22, 2017, 07:22 Steve Dower, <steve.dower@python.org> wrote:

...

I believe the trend is due to language like Python and Node.js, most of which aggressively discourage threading (more from the broader community than the core languages, but I see a lot of apps using these now), and also the higher reliability afforded by out-of-process tasks (that is, one crash doesn’t kill the entire app – e.g browser tabs).

Optimizing startup time is incredibly valuable, and having tried it a few times I believe that the import system (in essence, stat calls) is the biggest culprit. The tens of ms prior to the first user import can’t really go anywhere.

Stat calls in the import system were optimized in importlib a while back to be cached in finders so at this point you will have to remove a stat call to lower that cost or cache more which goes into breaking abstractions or designing new APIs. -brett

...

Cheers,

Steve

Top-posted from my Windows phone

*From: *Alex Walters <tritium-list@sdamon.com> *Sent: *Saturday, July 22, 2017 1:39 *Cc: *'Python-Dev' <python-dev@python.org>

*Subject: *Re: [Python-Dev] Python startup time

...
-----Original Message-----

...
From: Python-Dev [mailto:python-dev-bounces+tritium-

...
list=sdamon.com@python.org] On Behalf Of Paul Moore

...
Sent: Saturday, July 22, 2017 4:14 AM

...
To: David Mertz <mertz@gnosis.cx>

...
Cc: Barry Warsaw <barry@python.org>; Python-Dev <python-

...
dev@python.org>

...
Subject: Re: [Python-Dev] Python startup time

...
It's a bit of a chicken and egg problem - Windows users avoid

...
excessive command line program invocation because startup time is

...
high, so no-one optimises startup time because Windows users don't use

...
short-lived command line programs. But I'm seeing a trend away from

...
that - more and more Windows tools these days seem to be comfortable

...
spawning subprocesses. I don't know what prompted that trend.

The programs I see that are comfortable spawning processes willy-nilly on

windows are mostly .net, which has a lot of the runtime assemblies cached by

the OS in the GAC - if you are spawning a second processes of yourself, or

something that uses the same libraries as you, the compile step on those can

be skipped. Unless you are talking about python/non-.NET programs, in which

case, I have no answer.

...
Paul

...
_______________________________________________

...
Python-Dev mailing list

...
Python-Dev@python.org

...
https://mail.python.org/mailman/listinfo/python-dev

...
Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium-

...
list%40sdamon.com

_______________________________________________

Python-Dev mailing list

Python-Dev@python.org

https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40python.org

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Steve Dower

11:35 p.m.

...

-----Original Message----- From: Python-Dev [mailto:python-dev-bounces+tritium- list=sdamon.com@python.org] On Behalf Of Paul Moore Sent: Saturday, July 22, 2017 4:14 AM To: David Mertz <mertz@gnosis.cx> Cc: Barry Warsaw <barry@python.org>; Python-Dev <python- dev@python.org> Subject: Re: [Python-Dev] Python startup time It's a bit of a chicken and egg problem - Windows users avoid excessive command line program invocation because startup time is high, so no-one optimises startup time because Windows users don't use short-lived command line programs. But I'm seeing a trend away from that - more and more Windows tools these days seem to be comfortable spawning subprocesses. I don't know what prompted that trend. The programs I see that are comfortable spawning processes willy-nilly on windows are mostly .net, which has a lot of the runtime assemblies cached by

“Stat calls in the import system were optimized in importlib a while back” Yes, I’m aware of that, which is why I don’t have any specific suggestions off-hand. But given the differences in file systems between Windows and other OSs, it wouldn’t surprise me if there were a more optimal approach for NTFS to amortize calls better. Perhaps not, but it is still the most expensive part of startup that we have any ability to change, so it’s worth investigating. Cheers, Steve Top-posted from my Windows phone From: Brett Cannon Sent: Saturday, July 22, 2017 10:18 To: Steve Dower; Alex Walters Cc: Python-Dev Subject: Re: [Python-Dev] Python startup time On Sat, Jul 22, 2017, 07:22 Steve Dower, <steve.dower@python.org> wrote: I believe the trend is due to language like Python and Node.js, most of which aggressively discourage threading (more from the broader community than the core languages, but I see a lot of apps using these now), and also the higher reliability afforded by out-of-process tasks (that is, one crash doesn’t kill the entire app – e.g browser tabs). Optimizing startup time is incredibly valuable, and having tried it a few times I believe that the import system (in essence, stat calls) is the biggest culprit. The tens of ms prior to the first user import can’t really go anywhere. Stat calls in the import system were optimized in importlib a while back to be cached in finders so at this point you will have to remove a stat call to lower that cost or cache more which goes into breaking abstractions or designing new APIs. -brett Cheers, Steve Top-posted from my Windows phone From: Alex Walters Sent: Saturday, July 22, 2017 1:39 Cc: 'Python-Dev' Subject: Re: [Python-Dev] Python startup time the OS in the GAC - if you are spawning a second processes of yourself, or something that uses the same libraries as you, the compile step on those can be skipped. Unless you are talking about python/non-.NET programs, in which case, I have no answer.

...

Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium- list%40sdamon.com

Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40python.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Antoine Pitrou

8:18 a.m.

On Sat, 22 Jul 2017 16:35:31 -0700 Steve Dower <steve.dower@python.org> wrote:

...

Yes, I’m aware of that, which is why I don’t have any specific suggestions off-hand. But given the differences in file systems between Windows and other OSs, it wouldn’t surprise me if there were a more optimal approach for NTFS to amortize calls better. Perhaps not, but it is still the most expensive part of startup that we have any ability to change, so it’s worth investigating.

Can you expand on it being "the most expensive part of startup that we have any ability to change"? For example, how do Nick's benchmarks above fare on Windows? Regards Antoine.

Nick Coghlan

3:59 a.m.

On 23 July 2017 at 09:35, Steve Dower <steve.dower@python.org> wrote:

...

Yes, I’m aware of that, which is why I don’t have any specific suggestions off-hand. But given the differences in file systems between Windows and other OSs, it wouldn’t surprise me if there were a more optimal approach for NTFS to amortize calls better. Perhaps not, but it is still the most expensive part of startup that we have any ability to change, so it’s worth investigating.

That does remind me of a capability we haven''t played with a lot recently: $ python3 -m site sys.path = [ '/home/ncoghlan', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/home/ncoghlan/.local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', ] USER_BASE: '/home/ncoghlan/.local' (exists) USER_SITE: '/home/ncoghlan/.local/lib/python3.6/site-packages' (exists) ENABLE_USER_SITE: True The interpreter puts a zip file ahead of the regular unpacked standard library on sys.path because at one point in time that was a useful optimisation technique for reducing import costs on application startup. It was a potentially big win with the old "multiple stat calls" import implementation, but I'm not aware of any more recent benchmarks relative to the current listdir-caching based import implementation. So I think some interesting experiments to try measuring might be: - pushing the "always imported" modules into a dedicated zip archive - having the interpreter pre-seed sys.modules with the contents of that dedicated archive - freezing those modules and building them into the interpreter that way - compiling the standalone top-level modules with Cython, and loading them as extension modules - compiling in the Cython generated modules as builtins (not currently an option for packages & submodules due to [1]) The nice thing about those kinds of approaches is that they're all fairly general purpose, and relate primarily to how the Python interpreter is put together, rather than how the individual modules are written in the first place. (I'm not volunteering to run those experiments, though - just pointing out some of the technical options we have available to us that don't involve adding more handcrafted C extension modules to CPython) [1] https://bugs.python.org/issue1644818 Cheers, NIck. P.S. Checking the current list of source modules implicitly loaded at startup, I get:

...

...
...
import sys sorted(k for k, m in sys.modules.items() if m.__spec__ is not None and type(m.__spec__.loader).__name__ == "SourceFileLoader") ['_collections_abc', '_sitebuiltins', '_weakrefset', 'abc', 'codecs', 'encodings', 'encodings.aliases', 'encodings.latin_1', 'encodings.utf_8', 'genericpath', 'io', 'os', 'os.path', 'posixpath', 'rlcompleter', 'site', 'stat']

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Michel Desmoulin

7:52 a.m.

...

Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've been using python since the 2.4 and I never felt the burden of the startup time. I'm guessing a lot of people are like me, they just don't express them self because "better startup time can't be bad so let's not put a barrier on this". I'm not against it, but since the necessity of a faster Python in general has been a debate for years and is only finally catching up with the work of Victor Stinner, can somebody explain me the deal with start up time ? I understand where it can improve your lives. I just don't get why it's suddenly such an explosion of expectations and needs.

Brett Cannon

5:36 p.m.

On Sun, Jul 23, 2017, 00:53 Michel Desmoulin, <desmoulinmichel@gmail.com> wrote:

...

...
Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've been using python since the 2.4 and I never felt the burden of the startup time.

I'm guessing a lot of people are like me, they just don't express them self because "better startup time can't be bad so let's not put a barrier on this".

I'm not against it, but since the necessity of a faster Python in general has been a debate for years and is only finally catching up with the work of Victor Stinner, can somebody explain me the deal with start up time ?

I understand where it can improve your lives. I just don't get why it's suddenly such an explosion of expectations and needs.

It's actually always been something we have tried to improve, it just comes in waves. For instance we occasionally re-examine what modules get pulled in during startup. Importlib was optimized to help with startup. This just happens to be the latest round of trying to improve the situation. As for why we care, every command-line app wants to at least appear faster if not be faster because just getting to the point of being able to e.g. print a version number is dominated by Python and app start-up. And this is not guessing; I work with a team that puts out a command line app and one of the biggest complaints they get is the startup time. -brett _______________________________________________

...

Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Michel Desmoulin

5:52 p.m.

Le 23/07/2017 à 19:36, Brett Cannon a écrit :

...

On Sun, Jul 23, 2017, 00:53 Michel Desmoulin, <desmoulinmichel@gmail.com <mailto:desmoulinmichel@gmail.com>> wrote:

> Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've been using python since the 2.4 and I never felt the burden of the startup time.

I'm guessing a lot of people are like me, they just don't express them self because "better startup time can't be bad so let's not put a barrier on this".

I'm not against it, but since the necessity of a faster Python in general has been a debate for years and is only finally catching up with the work of Victor Stinner, can somebody explain me the deal with start up time ?

I understand where it can improve your lives. I just don't get why it's suddenly such an explosion of expectations and needs.

It's actually always been something we have tried to improve, it just comes in waves. For instance we occasionally re-examine what modules get pulled in during startup. Importlib was optimized to help with startup. This just happens to be the latest round of trying to improve the situation.

As for why we care, every command-line app wants to at least appear faster if not be faster because just getting to the point of being able to e.g. print a version number is dominated by Python and app start-up.

Fair enought.

...

And this is not guessing; I work with a team that puts out a command line app and one of the biggest complaints they get is the startup time.

This I don't get. When I run any command line utility in python (grin, ffind, pyped, django-admin.py...), the execute in a split second. I can't even SEE the different between: python3 -c "import os; [print(x) for x in os.listdir('.')]" and ls . I'm having a hard time understanding how the Python VM startup time can be perceived as a barriere here. I can understand if you have an application firing Python 1000 times a second, like a CGI service or some kind of code exec service. But scripting ? Now I can imagine that a given Python program can be slow to start up, because it imports a lot of things. But not the VM itself.

...

-brett

_______________________________________________ Python-Dev mailing list Python-Dev@python.org <mailto:Python-Dev@python.org> https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Brett Cannon

8:28 p.m.

On Sun, Jul 23, 2017, 10:52 Michel Desmoulin, <desmoulinmichel@gmail.com> wrote:

...

Le 23/07/2017 à 19:36, Brett Cannon a écrit :

...
On Sun, Jul 23, 2017, 00:53 Michel Desmoulin, <desmoulinmichel@gmail.com <mailto:desmoulinmichel@gmail.com>> wrote:

> Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've

been

...
using python since the 2.4 and I never felt the burden of the startup time.

I'm guessing a lot of people are like me, they just don't express

them

...
self because "better startup time can't be bad so let's not put a barrier on this".

I'm not against it, but since the necessity of a faster Python in general has been a debate for years and is only finally catching up

with

...
the work of Victor Stinner, can somebody explain me the deal with

start

...
up time ?

I understand where it can improve your lives. I just don't get why

it's

...
suddenly such an explosion of expectations and needs.

It's actually always been something we have tried to improve, it just comes in waves. For instance we occasionally re-examine what modules get pulled in during startup. Importlib was optimized to help with startup. This just happens to be the latest round of trying to improve the

situation.

...
As for why we care, every command-line app wants to at least appear faster if not be faster because just getting to the point of being able to e.g. print a version number is dominated by Python and app start-up.

Fair enought.

...
And this is not guessing; I work with a team that puts out a command line app and one of the biggest complaints they get is the startup time.

This I don't get. When I run any command line utility in python (grin, ffind, pyped, django-admin.py...), the execute in a split second.

I can't even SEE the different between:

python3 -c "import os; [print(x) for x in os.listdir('.')]"

and

ls .

I'm having a hard time understanding how the Python VM startup time can be perceived as a barriere here. I can understand if you have an application firing Python 1000 times a second, like a CGI service or some kind of code exec service. But scripting ?

So you're viewing it from a single OS and single machine perspective. Stuff varies so much that you can't compare something like this based on a single experience. I also said "appear" on purpose. 😉 Some people just compare Python against other languages based on benchmarks like startup when choosing a language so part of this is optics. This also applies when people compare Python 2 to 3.

...

Now I can imagine that a given Python program can be slow to start up, because it imports a lot of things. But not the VM itself.

There's also the fact that some things we might do to speed up Python's own startup will propagate to user code and so have a bigger effect, e.g. making namedtuple cheaper reaches into user code that uses namedtuple. IOW based on experience this is worth the time to look into.

...

...
-brett

_______________________________________________ Python-Dev mailing list Python-Dev@python.org <mailto:Python-Dev@python.org> https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/brett%40python.org

...

Stefan Behnel

8:43 p.m.

New subject: Cython compiled stdlib modules - Re: Python startup time

Nick Coghlan schrieb am 21.07.2017 um 08:23:

...

I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster.

Just for the sake of it, I gave the Cython compilation a try. I had to apply the attached hack to Lib/typing.py to get the test passing, because it uses frame call offsets in some places and Cython functions do not create frames when being called (they only create them for exception traces). I also had to disable the import of "abc" in the Cython generated module to remove the circular self dependency at startup when the "abc" module is compiled. That shouldn't have an impact on the runtime performance, though. Note that this is otherwise using the unmodified Python code, as provided in the current modules, constructing and using normal Python classes for everything, no extension types etc. Only two stdlib Python modules were compiled into shared libraries, and not statically linked into the CPython core. I used the "python_startup" benchmark in the "performance" project to measure the overall startup times of a clean non-debug non-pgo build of CPython 3.7 (rev d0969d6) against the same build with a compiled typing.py and abc.py. To compile these modules, I used the following command (plus the attached patch) $ cythonize -3 -X binding=True -i Lib/typing.py Lib/abc.py I modified the startup benchmark to run "python -c 'import typing'" etc. instead of just executing "pass". - stock CPython starting up and running "pass": Mean +- std dev: 14.7 ms +- 0.3 ms - stock CPython starting up and running "import abc": Mean +- std dev: 14.8 ms +- 0.3 ms - with compiled abc.py: Mean +- std dev: 14.9 ms +- 0.3 ms - stock CPython starting up and running "import typing": Mean +- std dev: 34.6 ms +- 1.0 ms - with compiled abc.py Mean +- std dev: 34.4 ms +- 0.6 ms - with compiled typing.py: Mean +- std dev: 33.5 ms +- 0.7 ms - with both compiled: Mean +- std dev: 33.1 ms +- 0.4 ms That's only a 4% improvement in the overall startup time on my machine, and about a 7% faster overall runtime of "import typing" compared to "pass". Note also that compiling abc.py leads to a slightly *increased* startup time in the "import abc" case, which might be due to the larger file size of the abc.so file compared to the abc.pyc file. This is amortised by the decreased runtime in the "import typing" case (I guess). I then ran the test suites for both modules in lack of a better post-startup runtime benchmark. The improvement for abc.py is in the order of 1-2%, but test_typing.py has many more tests and wins about 13% overall: - stock CPython executing essentially "runner.run(deepcopy(suite))" in "test_typing.py" (the deepcopy() takes about 6 ms): Mean +- std dev: 68.6 ms +- 0.8 ms - compiled abc.py and typing.py: Mean +- std dev: 60.7 ms +- 0.7 ms One more thing to note: the compiled modules are quite large. I get these file sizes: 8658 Lib/abc.py 7525 Lib/__pycache__/abc.cpython-37.pyc 369930 Lib/abc.c 122048 Lib/abc.cpython-37m-x86_64-linux-gnu.so 80290 Lib/typing.py 73921 Lib/__pycache__/typing.cpython-37.pyc 2951893 Lib/typing.c 1182632 Lib/typing.cpython-37m-x86_64-linux-gnu.so The .so files are about 16x as large as the .pyc files. The typing.so file weighs in with about 40% of the size of the stripped python binary: 2889136 python As it stands, the gain is probably not worth the increase in library file size, which also translates to a higher bottom line for the memory consumption. At least not for these two modules. Manually optimising the files would likely also reduce the .so file size in addition to giving better speedups, though, because the generated code would become less generic. Stefan

Nick Coghlan

2:25 p.m.

New subject: Cython compiled stdlib modules - Re: Python startup time

On 22 July 2017 at 06:43, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

Nick Coghlan schrieb am 21.07.2017 um 08:23:

...
I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster.

Just for the sake of it, I gave the Cython compilation a try. I had to apply the attached hack to Lib/typing.py to get the test passing, because it uses frame call offsets in some places and Cython functions do not create frames when being called (they only create them for exception traces). I also had to disable the import of "abc" in the Cython generated module to remove the circular self dependency at startup when the "abc" module is compiled. That shouldn't have an impact on the runtime performance, though.

[snip]

...

As it stands, the gain is probably not worth the increase in library file size, which also translates to a higher bottom line for the memory consumption. At least not for these two modules. Manually optimising the files would likely also reduce the .so file size in addition to giving better speedups, though, because the generated code would become less generic.

Thanks for trying the experiment! I agree with your conclusion that the file size impact likely rules it out as a general technique. Selective freezing may still be interesting though, since that at least avoids the import path searches and merges the disk read into the initial loading of the executable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

2:05 p.m.

On 19 July 2017 at 22:59, Victor Stinner <victor.stinner@gmail.com> wrote:

...

== CPython core developers don't care? no, they do care ==

Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me (Victor Stinner) and other core developers made multiple changes last years to reduce the number of imports at startup, optimize impotlib, etc.

I actually also care myself, since interpreter startup time feeds directly into cost of execution when running in environments like AWS Lambda which charge by the "gigabyte second" (i.e. you allocate a certain amount of RAM to a particular command, and then get charged for that RAM for the amount of time it takes to run, as measured with subsecond precision - if you exceed the limits of the free tier, anything you 're losing to language runtime startup in such an environment translates almost directly to higher costs). In aggregate, shaving time off CPython startup saves *scary* amounts of collective compute time around the world - even though most runtime environments don't track that as closely in financial terms as Lambda does, we're still nudging the power & cooling requirements of data centers slightly higher than they would otherwise be. So even when the per-invocation impact of a performance improvement is small, it's worth keeping in mind that CPython gets invoked a *lot*, whether it's to respond to a web request, run a test, run a build, deploy another application, analyse some data, etc :) However, I'm also of the view that module & API maintainers *do* have the authority to set the design priorities for the parts of the standard library that they're personally responsible for, and if we'd like them to change their minds based on information we have that they don't, then reopening enhancement requests that they already closed is *not* the way to go about it (as while the issue tracker is an excellent venue for figuring out the technical details of a change, or deciding whether or not an RFE is a good idea given a common understanding of the relevant design priorities, it's almost always a *terrible* venue for resolving outright disagreements as to what the most relevant design priorities actually are). Rather, the best available way to publicly request reconsideration is the way Antoine did when he escalated the namedtuple question to python-dev: by explicitly acknowledging that there's a conflict in design priorities between core developers, and asking for a collective discussion (and potentially a determination from Guido) as to the right way forward for the project as a whole. Cheers, Nick. P.S. I'll also note that we're not *actually* limited to resolving such conflicts in public venues (even though I think that's a good default habit for us to retain): as long as we report the outcome of any mutual agreements about design priorities back to the relevant public venue (e.g. a tracker issue), there's nothing wrong with shifting our attempts to better understand each other's perspectives to private email, IRC, video chat, etc. A non-trivial number of previously vociferous arguments have been resolved amicably once the main parties involved have had a chance to discuss them in person at a conference or sprint. It can even make sense to reach out to other core devs for help, since it's almost always easier for someone not caught in the midst of an argument to see both sides of it, and potentially spot a core of agreement amidst various surface level disagreements :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Terry Reedy

6:20 a.m.

On 7/19/2017 10:05 AM, Nick Coghlan wrote:

...

P.S. I'll also note that we're not *actually* limited to resolving such conflicts in public venues (even though I think that's a good default habit for us to retain): as long as we report the outcome of any mutual agreements about design priorities back to the relevant public venue (e.g. a tracker issue), there's nothing wrong with shifting our attempts to better understand each other's perspectives to private email, IRC, video chat, etc.

I expect and hope that there will be discussion of this issue at the core developer sprint in September, with summary reports back here on pydev.

...

It can even make sense to reach out to other core devs for help, since it's almost always easier for someone not caught in the midst of an argument to see both sides of it, and potentially spot a core of agreement amidst various surface level disagreements :)

I always understood the Python development process, both for core and users, to be "Make it right; then make it faster", with the second clause conditioned on 'while keeping it right' and maybe, and especially for core development 'if significantly slow'. (People can rightly work on speed of personal code for other reasons.) I believe we pretty much agree on the principles. The disagreement seems to be on whether a particular case is 'significantly slow'. I believe that the burden of proof is with those who propose a change. The burden of the proof depends on the final qualification: 'without adding unnecessary or extreme complexity'. If there is no added complication, the burden is slight. If not, we will likely disagree about complexity and its tradeoff with speed. About 'keeping it right': It has been mentioned that more complicated code *generally* makes it harder to 'see' that the code is (basically) correct. The second line of defense is the automated test suite. I think, for instance, that someone interested in changing namedtuple (to a faster and presumably more complicated implementation) should check the coverage of the current code, with branches checked both ways. Then, bring the coverage up to 100% if is not already, and carefully check the test for possible missing cases. A small static set of test cases cannot cover everything. The third test of an implementation is accumulated user experience. A new implementation starts at 0. One way to increase that is test the implementation with 3rd-part code. Another, I think, is through randomized testing. Proposal 1: Depending on our confidence in a new implementation, simulate user experience with randomized tests, perhaps running for hours. Example: we develop a random (unicode) identifier generator that starts with any of the legal initial codepoints and continues with a random number of legal follow codepoints. Then test (old) and new namedtuple with random class and a random number of random field names. A developer could also use third-party packages, like hypothesis. Code and a summary could be uploaded to bpo. A summary could even go in the code file. Note 1: Tim Peters did something like this when developing timsort. He provided a nice summary of test cases and time results. Note 2: Randomized tests require that either a) randomized inputs are verified by property or predicate, rather than by hard-coded values, or b) inputs are generated from outputs, where either the output or inverse generation are randomized. Tests of sorting can use either is_sorted(list(sorted(random_input))) or list(sorted(random_shuffle(output))) == output. Proposal 2: Add randomized tests here and there in the test suite. Each randomized test x 30 buildbots x 2 runs/day x 365 days/year is about 22000 random inputs a year. Since each buildbot would be running a slightly different test, we need to act on and not ignore sporadic failures. Victor Stinner's buildbot work is making this feasible. -- Terry Jan Reedy -- Terry Jan Reedy

Ivan Levkivskyi

11:24 a.m.

I agree the start-up time is important. There is something that is related. ABCMeta is currently implemented in Python. This makes it slow, creation of an ABC is 2x slower than creation of a normal class. However, ABCs are used by many medium and large size projects. Also, both abc and _collections_abc are imported at start-up (in particular importlib uses several ABCs, os also needs them for environments). Finally, all generics in typing module and user-defined generic types are ABCs (to allow interoperability with collections.abc). My idea is to re-implement ABCMeta (and ingredients it depends on, like WeakSet) in C. I didn't find such proposal on b.p.o., I have two questions: * Are there some potential problems with this idea (except that it may take some time and effort)? * Is it something worth doing as an optimization? (If answers are no and yes, then maybe I would spend part of my vacation in August on it.) -- Ivan

INADA Naoki

12:29 p.m.

Hi, Ivan. First of all, Yes, please do it! On Thu, Jul 20, 2017 at 8:24 PM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:

...

I agree the start-up time is important. There is something that is related. ABCMeta is currently implemented in Python. This makes it slow, creation of an ABC is 2x slower than creation of a normal class.

Additionally, ABC infects by inheritance. When people use mix-in provided by collections.abc, the class is ABC even if it's concrete class. There are no documented/recommended way to inherit from ABC class but not use ABCMeta.

...

However, ABCs are used by many medium and large size projects.

Many people having other language background uses ABC for Java's interface or Abstract Class. So it may worth enough to have just Abstract, but not ABC. See https://mail.python.org/pipermail/python-ideas/2017-July/046495.html

...

Also, both abc and _collections_abc are imported at start-up (in particular importlib uses several ABCs, os also needs them for environments). Finally, all generics in typing module and user-defined generic types are ABCs (to allow interoperability with collections.abc).

Yes. Even if site.py doesn't use typing, many application and libraries will start using typing. And it's much slower than collections.abc.

...

My idea is to re-implement ABCMeta (and ingredients it depends on, like WeakSet) in C. I didn't find such proposal on b.p.o., I have two questions: * Are there some potential problems with this idea (except that it may take some time and effort)?

WeakSet should be cared specially. Maybe, ABCMeta can be optimized first. Currently, ABCMeta use three WeakSets. But it can be delayed until `register` or `issubclass` is called. So even if WeakSet is implemented in Python, I think ABCMeta can be much faster.

...

* Is it something worth doing as an optimization? (If answers are no and yes, then maybe I would spend part of my vacation in August on it.)

-- Ivan

Bests,

Antoine Pitrou

12:56 p.m.

On Thu, 20 Jul 2017 21:29:18 +0900 INADA Naoki <songofacandy@gmail.com> wrote:

...

WeakSet should be cared specially. Maybe, ABCMeta can be optimized first.

Currently, ABCMeta use three WeakSets. But it can be delayed until `register` or `issubclass` is called. So even if WeakSet is implemented in Python, I think ABCMeta can be much faster.

Simple uses of WeakSet can probably be replaced with regular sets + weakref callbacks. As long as you are not doing one of the delicate things (such as iterate), it should be fine. Regards Antoine.

Stefan Behnel

1:32 p.m.

Ivan Levkivskyi schrieb am 20.07.2017 um 13:24:

...

I agree the start-up time is important. There is something that is related. ABCMeta is currently implemented in Python. This makes it slow, creation of an ABC is 2x slower than creation of a normal class. However, ABCs are used by many medium and large size projects. Also, both abc and _collections_abc are imported at start-up (in particular importlib uses several ABCs, os also needs them for environments). Finally, all generics in typing module and user-defined generic types are ABCs (to allow interoperability with collections.abc).

My idea is to re-implement ABCMeta (and ingredients it depends on, like WeakSet) in C.

I know that this hasn't really been an accepted option so far (and it's actually not an option for a few really early modules during startup), but compiling a Python module with Cython will usually speed it up quite noticibly (often 10-30%, sometimes more if you're lucky, e.g. [1]). And that also applies to the startup time, simply because it's pre-compiled. So, before considering to write an accelerator module in C that replaces some existing Python module, and thus duplicating its entire source code with highly increased complexity, I'd like to remind you that simply compiling the Python module itself to C should give at least reasonable speed-ups *without* adding to the maintenance burden, and can be done optionally as part of the build process. We do that for Cython itself during its installation, for example. Stefan (Cython core developer) [1] 3x faster URL routing by compiling a single Django module with Cython: https://us.pycon.org/2017/schedule/presentation/693/

Nick Coghlan

2:02 p.m.

On 20 July 2017 at 23:32, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

So, before considering to write an accelerator module in C that replaces some existing Python module, and thus duplicating its entire source code with highly increased complexity, I'd like to remind you that simply compiling the Python module itself to C should give at least reasonable speed-ups *without* adding to the maintenance burden, and can be done optionally as part of the build process. We do that for Cython itself during its installation, for example.

And if folks are concerned about the potential bootstrapping issues with this approach, the gist is that it would have to look something like this: Phase 0: freeze importlib - build a CPython with only builtin and frozen module support - use it to freeze importlib Phase 1: traditional CPython - build the traditional Python interpreter with no Cython accelerated modules Phase 2: accelerated CPython - if not otherwise available, use the traditional Python interpreter to download & install Cython in a virtual environment - run Cython to selectively precompile key modules (such as those implicitly imported at startup) Technically, phase 2 doesn't actually *change* CPython itself, since the import system is already setup such that if an extension module and a source module are side-by-side in the same directory, then the extension module will take precedence. As a result, precompiling with Cython is similar in many ways to precompiling to bytecode, its just that the result is native machine code with Python C API calls, rather than CPython bytecode. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Antoine Pitrou

3:13 p.m.

New subject: anecdotal data

On Wed, 19 Jul 2017 14:59:52 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:

...

Hi,

On Twitter, Raymond Hettinger wrote:

"The decision making process on Python-dev is an anti-pattern, governed by anecdotal data and ambiguity over what problem is solved."

https://twitter.com/raymondh/status/887069454693158912

About "anecdotal data", I would like to discuss the Python startup time.

And I would like to step back and examine the general criticism of "anecdotal data". Large software and hardware companies have the resources to conduct comprehensive surveys of how people use their products. For example, Intel might have accumulated millions of traces of critical production x86 code that they want to keep running efficiently (or even keep running at all). Apple might have thousands of third-party applications which they can simulate running on a newer version of whatever OS, core library or pieces of hardware those applications rely on. Even Google may nowadays have hundreds or thousands of critical services written in Go, and they may be able to assess the effect of further changes of the Go runtime on those services (not sure they do, but they would certainly have the resources to). CPython is a comparatively small, disorganized and volunteer-based community. It doesn't have the resources or organization required to lead such studies on a regular basis. Chances are it will never have. So all we can rely on is 1) our respective individual experiences in the field 2) anecdotal data. When we rewrote the Python 3 IO stack in C, we were relying on our intuition that high-performance IO is important, and on anecdotal data (micro-benchmarks) that the pure Python IO stack is slow. When Tim or Raymond tweak the lookup function for dicts, they rely on anecdotal data delivered by a few select micro-benchmarks, and their intuition that some use cases need to be fast (for example dicts with string keys or keys made up of consecutive integers). We don't have any hard data that all those optimizations are necessary for the majority of Python applications. I don't think anybody in the world has statistically sound data about the entire body of Python code, or even a sufficiently large and relevant subset thereof (such as "Python code used in production for critical services"). We aren't scientists. We are engineers and have to make with whatever anecdotes we are aware of (be they from our own experiences, or users' complaints). We can't just say "yes, there seems be a performance issue, but I'll wait until we have non-anecdotal data that it's important". Because that day will probably never come, and in the meantime our users will have fled elsewhere. Regards Antoine.

Guido van Rossum

3:57 p.m.

New subject: anecdotal data

Exactly. This is how Python came to be in the first place. Benchmarks are great, but don't underestimate creativity. On Jul 19, 2017 8:15 AM, "Antoine Pitrou" <solipsis@pitrou.net> wrote: On Wed, 19 Jul 2017 14:59:52 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:

...

Hi,

On Twitter, Raymond Hettinger wrote:

"The decision making process on Python-dev is an anti-pattern, governed by anecdotal data and ambiguity over what problem is solved."

https://twitter.com/raymondh/status/887069454693158912

About "anecdotal data", I would like to discuss the Python startup time.

Antoine Pitrou

3:20 p.m.

New subject: [OT] Twitter echo chamber (Python startup time)

On Wed, 19 Jul 2017 14:59:52 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:

...

Hi,

On Twitter, Raymond Hettinger wrote:

"The decision making process on Python-dev is an anti-pattern, governed by anecdotal data and ambiguity over what problem is solved."

https://twitter.com/raymondh/status/887069454693158912

Kind-of OT: while I understand (and have sometimes felt myself) the desire to vent frustration about a decision one doesn't agree with, thers should be *at least* a link to the discussion alluded to so that readers make their own mind. Otherwise, it feels to me like any disagreement here may end up chastised on Twitter by some influential figure of authority. That's not a pleasant place to be in. Regards Antoine.

Larry Hastings

7:15 p.m.

On 07/19/2017 05:59 AM, Victor Stinner wrote:

...

Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being /even slower/ to start was absolutely one of the reasons why they didn't want to upgrade. You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

...

So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3. https://www.mercurial-scm.org/wiki/Python3 Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate. Cheers, //arry/

Ben Hoyt

7:26 p.m.

Yes, agreed that startup time matters for scripting. I was talking to someone on the Google Cloud SDK (CLI) team recently, and they said startup time is a big deal for them ... it's especially problematic for shell tab completion helpers, because every time you press tab the shell has to load your Python program to do the completion. Even a couple dozen milliseconds is noticeable when you're typing quickly. -Ben On Wed, Jul 19, 2017 at 3:15 PM, Larry Hastings <larry@hastings.org> wrote:

...

On 07/19/2017 05:59 AM, Victor Stinner wrote:

Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being *even slower* to start was absolutely one of the reasons why they didn't want to upgrade.

You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3.

https://www.mercurial-scm.org/wiki/Python3

Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate.

Cheers,

*/arry*

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ benhoyt%40gmail.com

Antoine Pitrou

8:35 p.m.

On Wed, 19 Jul 2017 15:26:47 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:

...

Yes, agreed that startup time matters for scripting. I was talking to someone on the Google Cloud SDK (CLI) team recently, and they said startup time is a big deal for them ... it's especially problematic for shell tab completion helpers, because every time you press tab the shell has to load your Python program to do the completion.

And also, for the same reason, for shell prompt additions such as git-prompt. Mercurial had to write a C client (chg) to make this usable. Regards Antoine.

Chris Barker

11:11 p.m.

As long as we are talking anecdotes: If it could save a person’s life, could you find a way to save ten seconds off the boot time? If there were five million people using the Mac, and it took ten seconds extra to turn it on every day, that added up to three hundred million or so hours per year people would save, which was the equivalent of at least one hundred lifetimes saved per year. Steve Jobs. (http://stevejobsdailyquote.com/2014/03/26/boot-time/) It really does depend on how/what users are using Python for. In general, Python has been moving more and more toward a "systems development language" from a "scripting language". Which may make us think "scripting" issues like startup time don't matter -- but,. of course, they matter a lot to those use cases. -CHB On Wed, Jul 19, 2017 at 1:35 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Wed, 19 Jul 2017 15:26:47 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:

...
Yes, agreed that startup time matters for scripting. I was talking to someone on the Google Cloud SDK (CLI) team recently, and they said startup time is a big deal for them ... it's especially problematic for shell tab completion helpers, because every time you press tab the shell has to load your Python program to do the completion.

And also, for the same reason, for shell prompt additions such as git-prompt. Mercurial had to write a C client (chg) to make this usable.

Regards

Antoine.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ chris.barker%40noaa.gov

-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Steven D'Aprano

1:19 a.m.

On Wed, Jul 19, 2017 at 04:11:24PM -0700, Chris Barker wrote:

...

As long as we are talking anecdotes:

If it could save a person’s life, could you find a way to save ten seconds off the boot time? If there were five million people using the Mac, and it took ten seconds extra to turn it on every day, that added up to three hundred million or so hours per year people would save, which was the equivalent of at least one hundred lifetimes saved per year.

Steve Jobs.

And about a fifth of the time they spent standing in lines waiting to buy the latest unnecessary iGadget... But seriously, that calculation is completely bogus. Not only is Steve Job's arithmetic *completely* wrong, but the whole premise is nonsense. Do the maths yourself: ten seconds per day is 3650 seconds in a year, which is slightly over an hour (3600 seconds). Multiply by five million users, that's about five million hours, not 300 million. So Jobs exaggerates the time saved by a factor of sixty. (Or maybe Jobs was warning that Macs crash sixty times a day...) But the premise is wrong too. Those hypothetical people don't turn their Macs on in sequence, each person turning their computer on only after the previous person's Mac had finished booting. They effectively boot them up in parallel but offset, spread out over a 24 hour period, so about 3472 people booting up at the same time each minute of the day. Time savings for parallel processes don't add in the way Jobs adds them, if we treat this as 1440 parallel processes (one per minute of the day) we save 1440 hours a year. But really, the only meaningful calculation is the each person saves 10 seconds per day. We can't even meaningfully say they save one hour a year: it doesn't come nicely packaged up for you all at once, so you can actually do something useful with it, nor can you save those ten seconds from one day to the next. You only get one shot at using them. What can you do with ten seconds per day? By the time you decide what to do with the extra time, it's already gone. There are good reasons for speeding up boot time, but this sort of calculation is not one of them. I think it is in particularly bad taste to exaggerate the significance of it by putting it in terms of saving lives. You want to save real lives? How about fixing the conditions in the sweatshops that make Apple phones? And installing suicide nets around the building doesn't count. -- Steve

Zero Piraeus

2:40 a.m.

: On 19 July 2017 at 21:19, Steven D'Aprano <steve@pearwood.info> wrote:

...

But the premise is wrong too. Those hypothetical people don't turn their Macs on in sequence, each person turning their computer on only after the previous person's Mac had finished booting. They effectively boot them up in parallel but offset, spread out over a 24 hour period, so about 3472 people booting up at the same time each minute of the day. Time savings for parallel processes don't add in the way Jobs adds them, if we treat this as 1440 parallel processes (one per minute of the day) we save 1440 hours a year.

Ah, but the relevant unit here is person-hours, not hours: Jobs is claiming that *each* Mac user loses X% of *their* life to boot times, and then adds all those slices of life together into N lifetimes (which again, are counted in person-years, not years). It's still wrong, though: longer boot times actually increase the proportion of your life spent in meaningful activity (e.g. going to the canteen and talking to someone). -[]z.

Gregory Szorc

May 2018

3:26 a.m.

On 7/19/2017 12:15 PM, Larry Hastings wrote:

...

On 07/19/2017 05:59 AM, Victor Stinner wrote:

...
Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being /even slower/ to start was absolutely one of the reasons why they didn't want to upgrade.

You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

...
So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3.

https://www.mercurial-scm.org/wiki/Python3

Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate.

...

From my perspective, Python process startup and module import overhead is a severe problem for Python. I don't say this lightly, but in my mind

I just now found found this thread when searching the archive for threads about startup time. And I was searching for threads about startup time because Mercurial's startup time has been getting slower over the past few months and this is causing substantial pain. As I posted back in 2014 [1], CPython's startup overhead was >10% of the total CPU time in Mercurial's test suite. And when you factor in the time to import modules that get Mercurial to a point where it can run commands, it was more like 30%! Mercurial's full test suite currently runs `hg` ~25,000 times. Using Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for 3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s on 3.7/master. Even if you divide this by the number of available CPU cores, we're talking dozens of seconds of wall time just waiting for CPython to get to a place where Mercurial's first bytecode can execute. And the problem is worse when you factor in the time it takes to import Mercurial's own modules. As a concrete example, I recently landed a Mercurial patch [2] that stubs out zope.interface to prevent the import of 9 modules on every `hg` invocation. This "only" saved ~6.94ms for a typical `hg` invocation. But this decreased the CPU time required to run the test suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)! By the time CPython gets Mercurial to a point where we can run useful code, we've already blown most of or past the time budget where humans perceive an action/command as instantaneous. If you ignore startup overhead, Mercurial's performance compares quite well to Git's for many operations. But the reality is that CPython startup overhead makes it look like Mercurial is non-instantaneous before Mercurial even has the opportunity to execute meaningful code! Mercurial provides a `chg` program that essentially spins up a daemon `hg` process running a "command server" so the `chg` program [written in C - no startup overhead] can dispatch commands to an already-running Python/`hg` process and avoid paying the startup overhead cost. When you run Mercurial's test suite using `chg`, it completes *minutes* faster. `chg` exists mainly as a workaround for slow startup overhead. Changing gears, my day job is maintaining Firefox's build system. We use Python heavily in the build system. And again, Python startup overhead is problematic. I don't have numbers offhand, but we invoke likely a few hundred Python processes as part of building Firefox. It should be several thousand. But, we've had to "hack" parts of the build system to "batch" certain build actions in single process invocations in order to avoid Python startup overhead. This undermines the ability of some build tools to formulate a reasonable understanding of the DAG and it causes a bit of pain for build system developers and makes it difficult to achieve "no-op" and fast incremental builds because we're always invoking certain Python processes because we've had to move DAG awareness out of the build backend and into Python. At some point, we'll likely replace Python code with Rust so the build system is more "pure" and easier to maintain and reason about. I've seen posts in this thread and elsewhere in the CPython development universe that challenge whether milliseconds in startup time matter. Speaking as a Mercurial and Firefox build system developer, *milliseconds absolutely matter*. Going further, *fractions of milliseconds matter*. For Mercurial's test suite with its ~25,000 Python process invocations, 1ms translates to ~25s of CPU time. With 2.7, Mercurial can dispatch commands in ~50ms. When you load common extensions, it isn't uncommon to see process startup overhead of 100-150ms! A millisecond here. A millisecond there. Before you know it, we're talking *minutes* of CPU (and potentially wall) time in order to run Mercurial's test suite (or build Firefox, or ...). the problem causes me to question the viability of Python for popular use cases, such as CLI applications. When choosing a programming language, I want one that will scale as a project grows. Vanilla process overhead has Python starting off significantly slower than compiled code (or even Perl) and adding module import overhead into the mix makes Python slower and slower as projects grow. As someone who has to deal with this slowness on a daily basis, I can tell you that it is extremely frustrating and it does matter. I hope that the importance of the problem will be acknowledged (milliseconds *do* matter) and that creative minds will band together to address it. Since I am disproportionately impacted by this issue, if there's anything I can do to help, let me know. Gregory [1] https://mail.python.org/pipermail/python-dev/2014-May/134528.html [2] https://www.mercurial-scm.org/repo/hg/rev/856f381ad74b

Ray Donnelly

6:55 a.m.

On Wed, May 2, 2018, 4:53 AM Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

On 7/19/2017 12:15 PM, Larry Hastings wrote:

...
On 07/19/2017 05:59 AM, Victor Stinner wrote:

...
Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being /even slower/ to start was absolutely one of the reasons why they didn't want to

upgrade.

...
You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

...
So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3.

https://www.mercurial-scm.org/wiki/Python3

Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate.

I just now found found this thread when searching the archive for threads about startup time. And I was searching for threads about startup time because Mercurial's startup time has been getting slower over the past few months and this is causing substantial pain.

As I posted back in 2014 [1], CPython's startup overhead was >10% of the total CPU time in Mercurial's test suite. And when you factor in the time to import modules that get Mercurial to a point where it can run commands, it was more like 30%!

Mercurial's full test suite currently runs `hg` ~25,000 times. Using Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for 3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s on 3.7/master. Even if you divide this by the number of available CPU cores, we're talking dozens of seconds of wall time just waiting for CPython to get to a place where Mercurial's first bytecode can execute.

And the problem is worse when you factor in the time it takes to import Mercurial's own modules.

As a concrete example, I recently landed a Mercurial patch [2] that stubs out zope.interface to prevent the import of 9 modules on every `hg` invocation. This "only" saved ~6.94ms for a typical `hg` invocation. But this decreased the CPU time required to run the test suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)!

By the time CPython gets Mercurial to a point where we can run useful code, we've already blown most of or past the time budget where humans perceive an action/command as instantaneous. If you ignore startup overhead, Mercurial's performance compares quite well to Git's for many operations. But the reality is that CPython startup overhead makes it look like Mercurial is non-instantaneous before Mercurial even has the opportunity to execute meaningful code!

Mercurial provides a `chg` program that essentially spins up a daemon `hg` process running a "command server" so the `chg` program [written in C - no startup overhead] can dispatch commands to an already-running Python/`hg` process and avoid paying the startup overhead cost. When you run Mercurial's test suite using `chg`, it completes *minutes* faster. `chg` exists mainly as a workaround for slow startup overhead.

Changing gears, my day job is maintaining Firefox's build system. We use Python heavily in the build system. And again, Python startup overhead is problematic. I don't have numbers offhand, but we invoke likely a few hundred Python processes as part of building Firefox. It should be several thousand. But, we've had to "hack" parts of the build system to "batch" certain build actions in single process invocations in order to avoid Python startup overhead. This undermines the ability of some build tools to formulate a reasonable understanding of the DAG and it causes a bit of pain for build system developers and makes it difficult to achieve "no-op" and fast incremental builds because we're always invoking certain Python processes because we've had to move DAG awareness out of the build backend and into Python. At some point, we'll likely replace Python code with Rust so the build system is more "pure" and easier to maintain and reason about.

I've seen posts in this thread and elsewhere in the CPython development universe that challenge whether milliseconds in startup time matter. Speaking as a Mercurial and Firefox build system developer, *milliseconds absolutely matter*. Going further, *fractions of milliseconds matter*. For Mercurial's test suite with its ~25,000 Python process invocations, 1ms translates to ~25s of CPU time. With 2.7, Mercurial can dispatch commands in ~50ms. When you load common extensions, it isn't uncommon to see process startup overhead of 100-150ms! A millisecond here. A millisecond there. Before you know it, we're talking *minutes* of CPU (and potentially wall) time in order to run Mercurial's test suite (or build Firefox, or ...).

From my perspective, Python process startup and module import overhead is a severe problem for Python. I don't say this lightly, but in my mind the problem causes me to question the viability of Python for popular use cases, such as CLI applications. When choosing a programming language, I want one that will scale as a project grows. Vanilla process overhead has Python starting off significantly slower than compiled code (or even Perl) and adding module import overhead into the mix makes Python slower and slower as projects grow. As someone who has to deal with this slowness on a daily basis, I can tell you that it is extremely frustrating and it does matter. I hope that the importance of the problem will be acknowledged (milliseconds *do* matter) and that creative minds will band together to address it. Since I am disproportionately impacted by this issue, if there's anything I can do to help, let me know

Is your Python interpreter statically linked? The Python 3 ones from the anaconda distribution (use Miniconda!) are for Linux and macOS and that roughly halved our startup times.

...

Gregory

[1] https://mail.python.org/pipermail/python-dev/2014-May/134528.html [2] https://www.mercurial-scm.org/repo/hg/rev/856f381ad74b _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com

Victor Stinner

9:26 a.m.

What do you propose to make Python startup faster? As I wrote in my previous emails, many Python core developers care of the startup time and we are working on making it faster. INADA Naoki added -X importtime to identify slow imports and understand where Python spent its startup time. Recent example: Barry Warsaw identified that pkg_resources is slow and added importlib.resources to Python 3.7: https://docs.python.org/dev/library/importlib.html#module-importlib.resource... Brett Cannon is also working on a standard solution for lazy imports since many years: https://pypi.org/project/modutil/ https://snarky.ca/lazy-importing-in-python-3-7/ Nick Coghlan is working on the C API to configure Python startup: PEP 432. When it will be ready, maybe Mercurial could use a custom Python optimized for its use case. IMHO Python import system is inefficient. We try too many alternative names. Example with Python 3.8 $ ./python -vv:

...

...
...
import dontexist # trying /home/vstinner/prog/python/master/dontexist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/dontexist.abi3.so # trying /home/vstinner/prog/python/master/dontexist.so # trying /home/vstinner/prog/python/master/dontexist.py # trying /home/vstinner/prog/python/master/dontexist.pyc # trying /home/vstinner/prog/python/master/Lib/dontexist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/Lib/dontexist.abi3.so # trying /home/vstinner/prog/python/master/Lib/dontexist.so # trying /home/vstinner/prog/python/master/Lib/dontexist.py # trying /home/vstinner/prog/python/master/Lib/dontexist.pyc # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.abi3.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.py # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.pyc # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.abi3.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.py # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.pyc Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked ModuleNotFoundError: No module named 'dontexist'

Why do we still check for the .pyc file outside __pycache__ directories? Why do we have to check for 3 different names for .so files? Does Mercurial need all directories of sys.path? What's the status of the "system python" project? :-) I also would prefer Python without the site module. Can we rewrite this module in C maybe? Until recently, the site module was needed on Python to create the "mbcs" encoding alias. Hopefully, the feature has been removed into Lib/encodings/__init__.py (new private _alias_mbcs() function). Python 3.7b3+: $ python3.7 -X importtime -c pass import time: self [us] | cumulative | imported package import time: 95 | 95 | zipimport import time: 589 | 589 | _frozen_importlib_external import time: 67 | 67 | _codecs import time: 498 | 565 | codecs import time: 425 | 425 | encodings.aliases import time: 641 | 1629 | encodings import time: 228 | 228 | encodings.utf_8 import time: 143 | 143 | _signal import time: 335 | 335 | encodings.latin_1 import time: 58 | 58 | _abc import time: 265 | 322 | abc import time: 298 | 619 | io import time: 69 | 69 | _stat import time: 196 | 265 | stat import time: 169 | 169 | genericpath import time: 336 | 505 | posixpath import time: 1190 | 1190 | _collections_abc import time: 600 | 2557 | os import time: 223 | 223 | _sitebuiltins import time: 214 | 214 | sitecustomize import time: 74 | 74 | usercustomize import time: 477 | 3544 | site Victor

Antoine Pitrou

9:43 a.m.

On Wed, 2 May 2018 11:26:35 +0200 Victor Stinner <vstinner@redhat.com> wrote:

...

Brett Cannon is also working on a standard solution for lazy imports since many years: https://pypi.org/project/modutil/ https://snarky.ca/lazy-importing-in-python-3-7/

AFAIK, Mercurial already has its own lazy importer.

...

Nick Coghlan is working on the C API to configure Python startup: PEP 432. When it will be ready, maybe Mercurial could use a custom Python optimized for its use case.

IMHO Python import system is inefficient. We try too many alternative names.

The overhead of importing is not in trying too many names, but in loading the module and executing its bytecode.

...

Why do we still check for the .pyc file outside __pycache__ directories?

Because we support sourceless distributions.

...

Why do we have to check for 3 different names for .so files?

See https://bugs.python.org/issue32387 Regards Antoine.

Neil Schemenauer

8:10 p.m.

Antoine:

...

The overhead of importing is not in trying too many names, but in loading the module and executing its bytecode.

That was my conclusion as well when I did some profiling last fall at the Python core sprint. My lazy execution experiments are an attempt to solve this: https://github.com/python/cpython/pull/6194 I expect that Mercurial is already doing a lot of tricks to make execution more lazy. They have a lazy module import hook but they probably do other things to not execute more bytecode at startup then is needed. My lazy execution idea is that this could happen more automatically. I.e. don't pay for something you don't use. Right now, with eager module imports, you usually pay a price for every bit of bytecode that your program potentially uses. Another idea, suggested to me by Carl Shapiro, is to store unmarshalled Python data in the heap section of the executable (or in DLLs). Then, the OS page fault handling would take care of only loading the data into RAM that is actually being used. The linker would take care of fixing up pointer references. There are a lot of details to work out with this idea but I have heard that Jeethu Rao (Carl's colleague at Instagram) has a prototype implementation that shows promise. Regards, Neil

Gregory Szorc

4:42 p.m.

On Tue, May 1, 2018 at 11:55 PM, Ray Donnelly <mingw.android@gmail.com> wrote:

...

Is your Python interpreter statically linked? The Python 3 ones from the anaconda distribution (use Miniconda!) are for Linux and macOS and that roughly halved our startup times.

My Python interpreters use a shared library. I'll definitely investigate the performance of a statically-linked interpreter. Correct me if I'm wrong, but aren't there downsides with regards to C extension compatibility to not having a shared libpython? Or does all the packaging tooling "just work" without a libpython? (It's possible I have my wires crossed up with something else regarding a statically linked Python.) On Wed, May 2, 2018 at 2:26 AM, Victor Stinner <vstinner@redhat.com> wrote:

...

What do you propose to make Python startup faster?

That's a very good question. I'm not sure I'm able to answer it because I haven't dug too much into CPython's internals much farther than what is required to implement C extensions. But I can share insight from what the Mercurial project has collectively learned.

...

As I wrote in my previous emails, many Python core developers care of the startup time and we are working on making it faster.

INADA Naoki added -X importtime to identify slow imports and understand where Python spent its startup time.

-X importtime is a great start! For a follow-up enhancement, it would be useful to see what aspects of import are slow. Is it finding modules (involves filesystem I/O)? Is it unmarshaling pyc files? Is it executing the module code? If executing code, what part is slow? Inline statements/expressions? Compiling types? Printing the microseconds it takes to import a module is useful. But it only gives me a general direction: I want to know what parts of the import made it slow so I know if I should be focusing on code running during module import, slimming down the size of a module, eliminating the module import from fast paths, pursuing alternative module importers, etc.

...

Recent example: Barry Warsaw identified that pkg_resources is slow and added importlib.resources to Python 3.7: https://docs.python.org/dev/library/importlib.html#module- importlib.resources

Brett Cannon is also working on a standard solution for lazy imports since many years: https://pypi.org/project/modutil/ https://snarky.ca/lazy-importing-in-python-3-7/

Mercurial has used lazy module imports for years. On 2.7.14, it reduces `hg version` from ~160ms to ~55ms (~34% of original). On Python 3, we're using `importlib.util.LazyLoader` and it reduces `hg version` on 3.7 from ~245ms to ~120ms (~49% of original). I'm not sure why Python 3's built-in module importer doesn't yield the speedup that our custom Python 2 importer does. One explanation is our custom importer is more advanced than importlib. Another is that Python 3's import mechanism is slower (possibly due to being written in Python instead of C). We haven't yet spent much time optimizing Mercurial for Python 3: our immediate goal is to get it working first. Given the startup performance problem on Python 3, it is only a matter of time before we dig into this further. It's worth noting that lazy module importing can be undone via common patterns. Most commonly, `from foo import X`. It's *really* difficult to implement a proper object proxy. Mercurial's lazy importer gives up in this case and imports the module and exports the symbol. (But if the imported module is a package, we detect that and make the module exports proxies to a lazy module.) Another common undermining of the lazy importer is code that runs during import time module exec that accesses an attribute. e.g. ``` import foo class myobject(foo.Foo): pass ``` Mercurial goes out of its way to avoid these patterns so modules can be delay imported as much as possible. As long as import times are problematic, it would be helpful if the standard library adopted similar patterns. Although I recognize there are backwards compatibility concerns that tie your hands a bit.

...

Nick Coghlan is working on the C API to configure Python startup: PEP 432. When it will be ready, maybe Mercurial could use a custom Python optimized for its use case.

That looks great! The direction Mercurial is going in is that `hg` will likely become a Rust binary (instead of a #!python script) that will use an embedded Python interpreter. So we will have low-level control over the interpreter via the C API. I'd also like to see us distribute a copy of Python in our official builds. This will allow us to take various shortcuts, such as not having to probe various sys.path entries since certain packages can only exist in one place. I'd love to get to the state Google is at where they have self-contained binaries with ELF sections containing Python modules. But that requires a bit of very low-level hacking. We'll likely have a Rust binary (that possibly static links libpython) and a separate JAR/zip-like file containing resources. But many people obtain Python via their system package manager and no matter how hard we scream that Mercurial is a standalone application, they will configure their packages to link against the system libpython and use the system Python's standard library. This will potentially undo many of our startup time wins.

...

IMHO Python import system is inefficient. We try too many alternative names.

Example with Python 3.8

$ ./python -vv:

...
...
...
import dontexist # trying /home/vstinner/prog/python/master/dontexist.cpython-38dm- x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/dontexist.abi3.so # trying /home/vstinner/prog/python/master/dontexist.so # trying /home/vstinner/prog/python/master/dontexist.py # trying /home/vstinner/prog/python/master/dontexist.pyc # trying /home/vstinner/prog/python/master/Lib/dontexist.cpython- 38dm-x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/Lib/dontexist.abi3.so # trying /home/vstinner/prog/python/master/Lib/dontexist.so # trying /home/vstinner/prog/python/master/Lib/dontexist.py # trying /home/vstinner/prog/python/master/Lib/dontexist.pyc # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64- 3.8-pydebug/dontexist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64- 3.8-pydebug/dontexist.abi3.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64- 3.8-pydebug/dontexist.so # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64- 3.8-pydebug/dontexist.py # trying /home/vstinner/prog/python/master/build/lib.linux-x86_64- 3.8-pydebug/dontexist.pyc # trying /home/vstinner/.local/lib/python3.8/site-packages/dontex ist.cpython-38dm-x86_64-linux-gnu.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontex ist.abi3.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.so # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.py # trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.pyc Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked ModuleNotFoundError: No module named 'dontexist'

Why do we still check for the .pyc file outside __pycache__ directories?

Why do we have to check for 3 different names for .so files?

Yes, I also cringe every time I trace Python's system calls and see these needless stats and file opens. Unless Python adds the ability to tell the import mechanism what type of module to import, Mercurial will likely modify our custom importer to only look for specific files. We do provide pure Python modules for modules that have C implementations. But we have code that ensures that the C version is loaded for certain Python configurations because we don't want users accidentally using the non-C modules and then complaining about Mercurial's performance! We already denote the set of modules backed by C. What we're missing (but is certainly possible to implement) is code that limits the module finding search depending on whether the module is backed by Python or C. But this only really works for Mercurial's modules: we don't really know what the standard library is doing and coding assumptions into Mercurial about standard library behavior feels dangerous. If we ship our own Python distribution, we'll likely have a jar-like file containing all modules. Determining which file to load will read an in-memory file index and not require any expensive system calls to look for files.

...

Does Mercurial need all directories of sys.path?

No and yes. Mercurial by itself can get by with just the standard library and Mercurial's own packages. But extensions change everything. An extension could modify sys.path though. So limiting sys.path inside Mercurial is somewhat reasonable. Although it's definitely unexpected for a Python application to be removing entries from sys.path when the application starts.

...

What's the status of the "system python" project? :-)

I also would prefer Python without the site module. Can we rewrite this module in C maybe? Until recently, the site module was needed on Python to create the "mbcs" encoding alias. Hopefully, the feature has been removed into Lib/encodings/__init__.py (new private _alias_mbcs() function).

I also lament the startup time effects of site.py. When `hg` is a Rust binary, we will almost certainly skip site.py and manually perform any required actions that it was performing.

...

Python 3.7b3+:

$ python3.7 -X importtime -c pass import time: self [us] | cumulative | imported package import time: 95 | 95 | zipimport import time: 589 | 589 | _frozen_importlib_external import time: 67 | 67 | _codecs import time: 498 | 565 | codecs import time: 425 | 425 | encodings.aliases import time: 641 | 1629 | encodings import time: 228 | 228 | encodings.utf_8 import time: 143 | 143 | _signal import time: 335 | 335 | encodings.latin_1 import time: 58 | 58 | _abc import time: 265 | 322 | abc import time: 298 | 619 | io import time: 69 | 69 | _stat import time: 196 | 265 | stat import time: 169 | 169 | genericpath import time: 336 | 505 | posixpath import time: 1190 | 1190 | _collections_abc import time: 600 | 2557 | os import time: 223 | 223 | _sitebuiltins import time: 214 | 214 | sitecustomize import time: 74 | 74 | usercustomize import time: 477 | 3544 | site

As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files. Like Java's jar files. This has the advantage that there is only a single place to look for files in a given Python package. And since the bundle is immutable, you can index it so imports don't need to touch the filesystem to discover what is present: you do a quick memory lookup and jump straight to the available file. If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard. I know this kinda/sorta exists with zipimporter. But zipimporter uses zlib (slow) and only allows .py/.pyc files. And I think some Python application distribution tools have also solved this problem. I'd *really* like to see a proper/robust solution in Python itself. Along that vein, it would be really nice if the "standalone Python application" story were a bit more formalized. From my perspective, it is insanely difficult to package and distribute an application that happens to use Python. It requires vastly different solutions for different platforms. I want to declare a minimal boilerplate somewhere (perhaps in setup.py) and run a command that produces an as-self-contained-as-possible application complete with platform-native installers. Presumably such a self-contained application could take many shortcuts with regards to process startup and mitigate this general problem. Again, Mercurial is trending in the direction of making `hg` a Rust binary and distributing its own Python. Since we have to solve this packaging+distribution problem on multiple platforms, I'll try to keep an eye towards making whatever solution we concoct reusable by other projects.

Nathaniel Smith

5:55 p.m.

On Wed, May 2, 2018, 09:51 Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

Correct me if I'm wrong, but aren't there downsides with regards to C extension compatibility to not having a shared libpython? Or does all the packaging tooling "just work" without a libpython? (It's possible I have my wires crossed up with something else regarding a statically linked Python.)

IIRC, the rule on Linux is that if you build an extension on a statically built python, then it can be imported on a shared python, but not vice-versa. Manylinux wheels are therefore always built on a static python so that they'll work everywhere. (We should probably clean this up upstream at some point, but there's not a lot of appetite for touching this stuff – very obscure, very easy to break things without realizing it, not much upside.) On Windows I don't think there is such a thing as a static build, because extensions have to link to the python dll to work at all. And on MacOS I'm not sure, though from knowing how their linker works my guess is that all extensions act like static extensions do on Linux. -n

Ray Donnelly

12:21 a.m.

On Wed, May 2, 2018 at 6:55 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Wed, May 2, 2018, 09:51 Gregory Szorc <gregory.szorc@gmail.com> wrote:

...
Correct me if I'm wrong, but aren't there downsides with regards to C extension compatibility to not having a shared libpython? Or does all the packaging tooling "just work" without a libpython? (It's possible I have

my

...
wires crossed up with something else regarding a statically linked Python.)

IIRC, the rule on Linux is that if you build an extension on a statically built python, then it can be imported on a shared python, but not vice-versa. Manylinux wheels are therefore always built on a static python so that they'll work everywhere. (We should probably clean this up upstream at some point, but there's not a lot of appetite for touching this stuff – very obscure, very easy to break things without realizing it, not much upside.)

On Windows I don't think there is such a thing as a static build, because extensions have to link to the python dll to work at all. And on MacOS I'm not sure, though from knowing how their linker works my guess is that all extensions act like static extensions do on Linux.

Yes, on Windows there's always a python?.dll. macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that for a useless reference - and perhaps I got the wrong end of the stick) that Python for all Unixen should use a statically linked interpreter so I happily went ahead and did that. Of course I tested it against a good few wheels at the time and everything seemed fine (well, no worse than the usual binary compatibility woes at least) so I went ahead with it. Now that Python 3.7 is around the corner we have a chance to re-evaluate this decision. We have received no binary compat. bugs whatsoever due to this change (we got a few bugs where people used python-config incorrectly either directly or via swig or CMake), were we just lucky? Anyway, it is obviously safer for us to do what upstream does and I will try to post some benchmarks of static vs shared to the list so we can discuss it. I guess it is a little late in the release schedule to propose any such change for 3.7? If not I will try to prepare something. I will discuss it in depth with the rest of the AD team soon too.

...

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com

...

Antoine Pitrou

10 a.m.

New subject: Python linkage on macOS

On Fri, 04 May 2018 00:21:54 +0000 Ray Donnelly <mingw.android@gmail.com> wrote:

...

Yes, on Windows there's always a python?.dll.

macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that for a useless reference - and perhaps I got the wrong end of the stick) that Python for all Unixen should use a statically linked interpreter so I happily went ahead and did that.

A statically linked Python can also be significantly faster (10 to 20% IIRC, more perhaps on ARM). I think you already know about that :-)

...

Anyway, it is obviously safer for us to do what upstream does and I will try to post some benchmarks of static vs shared to the list so we can discuss it.

I have no idea what our default builds do on macOS, I'll let Ned Deily or another mac expert answer (changing the topic in the hope he notices this subthread :-)). Regards Antoine.

Ray Donnelly

12:10 p.m.

New subject: Python linkage on macOS

On Fri, May 4, 2018 at 11:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Fri, 04 May 2018 00:21:54 +0000 Ray Donnelly <mingw.android@gmail.com> wrote:

...
Yes, on Windows there's always a python?.dll.

macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that for a useless reference - and perhaps I got the wrong end of the stick) that Python for all Unixen should use a statically linked interpreter so I happily went ahead and did that.

A statically linked Python can also be significantly faster (10 to 20% IIRC, more perhaps on ARM). I think you already know about that :-)

Indeed, and it worked out well on Intel too. Thanks for the recommendation.

...

...
Anyway, it is obviously safer for us to do what upstream does and I will try to post some benchmarks of static vs shared to the list so we can discuss it.

I have no idea what our default builds do on macOS, I'll let Ned Deily or another mac expert answer (changing the topic in the hope he notices this subthread :-)).

And thanks for doing this. For the benchmarks I think I should build Python 3.6.5 (or would 3.7.0b4 be better?) from pyperformance built each way using the AD scripts and reply here with the results. If I do not get it done today then I hope to get them ready by Monday.

...

Regards

Antoine.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com

Ned Deily

1:52 p.m.

New subject: Python linkage on macOS

On May 4, 2018, at 08:10, Ray Donnelly <mingw.android@gmail.com> wrote:

...

On Fri, May 4, 2018 at 11:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
...
Anyway, it is obviously safer for us to do what upstream does and I will try to post some benchmarks of static vs shared to the list so we can discuss it. I have no idea what our default builds do on macOS, I'll let Ned Deily or another mac expert answer (changing the topic in the hope he notices

On Fri, 04 May 2018 00:21:54 +0000 Ray Donnelly <mingw.android@gmail.com> wrote: this subthread :-)). And thanks for doing this. For the benchmarks I think I should build Python 3.6.5 (or would 3.7.0b4 be better?) from pyperformance built each way using the AD scripts and reply here with the results. If I do not get it done today then I hope to get them ready by Monday.

The macOS python interpreters provided by python.org binary installers have always (for a very long time of always) been built as shared, in particular the special macOS framework build configuration. It would be very interesting to do Apple to Apple comparisons of shared vs static builds on macOS. I would look forward to seeing any results you have, Ray, and your methodology. Static builds is on my list of things to look at for 3.8. -- Ned Deily nad@python.org -- []

Antoine Pitrou

10:59 a.m.

New subject: static linking Python

On Fri, 04 May 2018 00:21:54 +0000 Ray Donnelly <mingw.android@gmail.com> wrote:

...

Now that Python 3.7 is around the corner we have a chance to re-evaluate this decision. We have received no binary compat. bugs whatsoever due to this change (we got a few bugs where people used python-config incorrectly either directly or via swig or CMake), were we just lucky?

As a sidenote, it seems there may be issues when static linking against Python to embed it: https://bugs.python.org/issue33438 Regards Antoine.

Barry Warsaw

9:24 p.m.

On May 2, 2018, at 09:42, Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files.

Of course, .so files have to be extracted to the file system, because we have to live with dlopen()’s API. In our first release of shiv, we had a loader that did exactly that for just .so files. We ended up just doing .pyz file unpacking unconditionally, ignoring zip-safe, mostly because too many packages still use __file__, which doesn’t work in a zipapp. I’ll plug shiv and importlib.resources (and the standalone importlib_resources) again here. :)

...

If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard.

shiv works in a similar manner to pex, although it’s a completely new implementation that doesn’t suffer from huge sys.paths or the use of pkg_resources. shiv + importlib.resources saves us 25-50% of warm cache startup time. That makes things better but still not ideal. Ultimately though that means we don’t suffer from the slowness of zlib since we don’t count cold cache times (i.e. before the initial pyz unpacking operation). Cheers, -Barry

Gregory Szorc

10:24 p.m.

On 5/2/18 2:24 PM, Barry Warsaw wrote:

...

On May 2, 2018, at 09:42, Gregory Szorc <gregory.szorc@gmail.com> wrote:

...
As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files.

Of course, .so files have to be extracted to the file system, because we have to live with dlopen()’s API. In our first release of shiv, we had a loader that did exactly that for just .so files. We ended up just doing .pyz file unpacking unconditionally, ignoring zip-safe, mostly because too many packages still use __file__, which doesn’t work in a zipapp.

FWIW, Google has a patched glibc that implements dlopen_with_offset(). It allows you to do things like memory map the current binary and then dlopen() a shared library embedded in an ELF section. I've seen the code in the branch at https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/.... It likely exists elsewhere. An attempt to upstream it occurred at https://sourceware.org/bugzilla/show_bug.cgi?id=11767. It is probably well worth someone's time to pick up the torch and get this landed in glibc so everyone can be a massive step closer to self-contained, single binary applications. Of course, it will take years before you can rely on a glibc version with this API being deployed universally. But the sooner this lands...

...

I’ll plug shiv and importlib.resources (and the standalone importlib_resources) again here. :)

...
If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard.

shiv works in a similar manner to pex, although it’s a completely new implementation that doesn’t suffer from huge sys.paths or the use of pkg_resources. shiv + importlib.resources saves us 25-50% of warm cache startup time. That makes things better but still not ideal. Ultimately though that means we don’t suffer from the slowness of zlib since we don’t count cold cache times (i.e. before the initial pyz unpacking operation).

Cheers, -Barry

Barry Warsaw

11:11 p.m.

On May 2, 2018, at 15:24, Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

FWIW, Google has a patched glibc that implements dlopen_with_offset(). It allows you to do things like memory map the current binary and then dlopen() a shared library embedded in an ELF section.

I've seen the code in the branch at https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/.... It likely exists elsewhere. An attempt to upstream it occurred at https://sourceware.org/bugzilla/show_bug.cgi?id=11767. It is probably well worth someone's time to pick up the torch and get this landed in glibc so everyone can be a massive step closer to self-contained, single binary applications. Of course, it will take years before you can rely on a glibc version with this API being deployed universally. But the sooner this lands...

Oh, I’m well aware of the history of this patch. :) I’d love to see it available on the platforms I use, and agree it’s well worth someone’s time to continue to shepherd this through the processes to make that happen. Even if it did take years to roll out, Python could use it with the proper compile-time checks. -Barry

Benjamin Peterson

3:26 a.m.

On Wed, May 2, 2018, at 09:42, Gregory Szorc wrote:

...

The direction Mercurial is going in is that `hg` will likely become a Rust binary (instead of a #!python script) that will use an embedded Python interpreter. So we will have low-level control over the interpreter via the C API. I'd also like to see us distribute a copy of Python in our official builds. This will allow us to take various shortcuts, such as not having to probe various sys.path entries since certain packages can only exist in one place. I'd love to get to the state Google is at where they have self-contained binaries with ELF sections containing Python modules. But that requires a bit of very low-level hacking. We'll likely have a Rust binary (that possibly static links libpython) and a separate JAR/zip-like file containing resources.

I'm curious about the rust binary. I can see that would give you startup time benefits similar to the ones you could get hacking the interpreter directly; e.g., you can use a zipfile for everything and not have site.py. But it seems like the Python-side wins would stop there. Is this all a prelude to incrementally rewriting hg in rust? (Mercuric oxide?)

Gregory Szorc

3:56 a.m.

On Wed, May 2, 2018 at 8:26 PM, Benjamin Peterson <benjamin@python.org> wrote:

...

...
The direction Mercurial is going in is that `hg` will likely become a Rust binary (instead of a #!python script) that will use an embedded Python interpreter. So we will have low-level control over the interpreter via

On Wed, May 2, 2018, at 09:42, Gregory Szorc wrote: the

...
C API. I'd also like to see us distribute a copy of Python in our official builds. This will allow us to take various shortcuts, such as not having to probe various sys.path entries since certain packages can only exist in one place. I'd love to get to the state Google is at where they have self-contained binaries with ELF sections containing Python modules. But that requires a bit of very low-level hacking. We'll likely have a Rust binary (that possibly static links libpython) and a separate JAR/zip-like file containing resources.

I'm curious about the rust binary. I can see that would give you startup time benefits similar to the ones you could get hacking the interpreter directly; e.g., you can use a zipfile for everything and not have site.py. But it seems like the Python-side wins would stop there. Is this all a prelude to incrementally rewriting hg in rust? (Mercuric oxide?)

The plans are recorded at https://www.mercurial-scm.org/wiki/OxidationPlan. tl;dr we want to write some low-level bits in Rust but we anticipate the bulk of the application logic remaining in Python. Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.

Glenn Linderman

5:56 a.m.

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

...

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years. Have you considered simply rewriting CPython in Rust?

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go? http://esr.ibiblio.org/?p=7724

Ryan Gonzalez

12:41 p.m.

I'm hardly an expert, but AFAIK CPython's start-up issues are more due to a mix of architectural issues and the fact that it's hard to optimize imports while maintaining backwards compatibility with Python's dynamism. -- Ryan (ライアン) Yoko Shimomura, ryo (supercell/EGOIST), Hiroyuki Sawano >> everyone else https://refi64.com/ On May 3, 2018 1:37:57 AM Glenn Linderman <v+python@g.nevcal.com> wrote:

...

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

...
Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years. Have you considered simply rewriting CPython in Rust?

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go? http://esr.ibiblio.org/?p=7724

---------- _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

Nick Coghlan

2:29 p.m.

On 3 May 2018 at 15:56, Glenn Linderman <v+python@g.nevcal.com> wrote:

...

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.

Have you considered simply rewriting CPython in Rust?

FWIW, I'd actually like to see Rust approved as a language for writing stdlib extension modules, but actually ever making that change in policy would require a concrete motivating use case.

...

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go?

Trying to get two different garbage collection engines to play nice with each other is a recipe for significant pain, since you can easily end up with uncollectable cycles that neither GC system has complete visibility into (all it needs is a loop from PyObject A -> Go Object B -> back to PyObject A). Combining Python and Rust can still get into that kind of trouble when using reference counting on the Rust side, but it's a lot easier to avoid than it is in runtimes with mandatory GC. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Brett Cannon

6:08 p.m.

On Thu, 3 May 2018 at 07:31 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 3 May 2018 at 15:56, Glenn Linderman <v+python@g.nevcal.com> wrote:

...
On 5/2/2018 8:56 PM, Gregory Szorc wrote:

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.

Have you considered simply rewriting CPython in Rust?

FWIW, I'd actually like to see Rust approved as a language for writing stdlib extension modules, but actually ever making that change in policy would require a concrete motivating use case.

Eric Snow, Barry Warsaw, and I have actually discussed this as part of our weekly open source office hours as work where we tend to talk about massive ideas that would take multiple people full-time to accomplish. :)

...

...
And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go?

Trying to get two different garbage collection engines to play nice with each other is a recipe for significant pain, since you can easily end up with uncollectable cycles that neither GC system has complete visibility into (all it needs is a loop from PyObject A -> Go Object B -> back to PyObject A).

Combining Python and Rust can still get into that kind of trouble when using reference counting on the Rust side, but it's a lot easier to avoid than it is in runtimes with mandatory GC.

Rust supports RAII <https://doc.rust-lang.org/rust-by-example/scope/raii.html> so it shouldn't be that bad.

Terry Reedy

4:01 a.m.

On 5/2/2018 12:42 PM, Gregory Szorc wrote:

...

I know this kinda/sorta exists with zipimporter. But zipimporter uses zlib (slow) and only allows .py/.pyc files. And I think some Python application distribution tools have also solved this problem. I'd *really* like to see a proper/robust solution in Python itself. Along that vein, it would be really nice if the "standalone Python application" story were a bit more formalized. From my perspective, it is insanely difficult to package and distribute an application that happens to use Python. It requires vastly different solutions for different platforms. I want to declare a minimal boilerplate somewhere (perhaps in setup.py) and run a command that produces an as-self-contained-as-possible application complete with platform-native installers.

I few years ago I helped my wife create a tutorial in the Renpy visual storytelling engine. It is free and open source. https://www.renpy.org It is written in Python, while users write scripts in both Python and a custom scripting language. When we were done, we pressed a button and it generated self-contained zip files for Windows, Linux, and Mac. This can be done from any of the three platforms. After we tested all three files, she created a web page with links to the three files for download. There have been no complaints so far. Perhaps the file generators could be adapted to packaging a project directory into a self-contained app. -- Terry Jan Reedy

Barry Warsaw

9:13 p.m.

Thanks for bringing this topic up again. At $day_job, this is a highly visible and important topic, since the majority of our command line tools are written in Python (of varying versions from 2.7 to 3.6). Some of those tools can take upwards of 5 seconds or more just to respond to —help, which causes lots of pain for developers, who complain (rightly so) up the management chain. ;) We’ve done a fair bit of work to bring those numbers down without super radical workarounds. Often there are problems not strictly related to the Python interpreter that contribute to this. Python gets blamed, but it’s not always the interpreter’s fault. Common issues include: * Modules that have import-time side effects, such as network access or expensive creation of data structures. Python 3.7’s `-X importtime` switch is a really wonderful way to identify the worst offenders. Once 3.7 is released, I do plan to spend some time using this to collect data internally so we can attack our own libraries, and perhaps put automated performance testing into our build stack, to identify start up time regressions. * pkg_resources. When you have tons of entries on sys.path, pkg_resources does a lot of work at import time, and because of common patterns which tend to use pkg_resources namespace package support in __init__.py files, this just kills start up times. Of course, pkg_resources has other uses too, so even in a purely Python 3 world (where your namespace packages can omit the __init__.py), you’ll often get clobbered as soon as you want to use the Basic Resource Access API. This is also pretty common, and it’s the main reason why Brett and I created importlib.resources for 3.7 (with a standalone API-compatible library for older Pythons). That’s one less reason to use pkg_resources, but it doesn’t address the __init__.py use. Brett and I have been talking about addressing that for 3.8. * pex - which we use as our single file zipapp tool. Especially the interaction between pex and pkg_resources introduces pretty significant overhead. My colleague Loren Carvalho created a tool called shiv which requires at least Python 3.6, avoids the use of pkg_resources, and implements other tricks to be much more performant than pex. Shiv is now open source and you can find it on RTD and GitHub. The switch to shiv and importlib.resources can shave 25-50% off of warm cache start up times for zipapp style executables. Another thing we’ve done, although I’m much less sanguine about them as a general approach, is to move imports into functions, but we’re trying to only use that trick on the most critical cases. Some import time effects can’t be changed. Decorators come to mind, and click is a popular library for CLIs that provides some great features, but decorators do prevent a lazy loading approach.

...

On May 1, 2018, at 20:26, Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

...
You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

Oh, for command line tools, milliseconds absolutely matter.

...

As a concrete example, I recently landed a Mercurial patch [2] that stubs out zope.interface to prevent the import of 9 modules on every `hg` invocation.

I have a similar dastardly plan to provide a pkg_resources stub :).

...

Mercurial provides a `chg` program that essentially spins up a daemon `hg` process running a "command server" so the `chg` program [written in C - no startup overhead] can dispatch commands to an already-running Python/`hg` process and avoid paying the startup overhead cost. When you run Mercurial's test suite using `chg`, it completes *minutes* faster. `chg` exists mainly as a workaround for slow startup overhead.

A couple of our developers demoed a similar approach for one of our CLIs that almost everyone uses. It’s a big application with lots of dependencies, so particularly vulnerable to pex and pkg_resources overhead. While it was just a prototype, it was darn impressive to see subsequent invocations produce output almost immediately. It’s unfortunate that we have to utilize all these tricks to get even moderately performant Python CLIs. A few of us spent some time at last year’s core Python dev talking about other things we could do to improve Python’s start up time, not just with the interpreter itself, but within the larger context of the Python ecosystem. Many ideas seem promising until you dive into the details, so it’s definitely hard to imagine maintaining all of Python’s dynamic semantics and still making it an order of magnitude faster to start up. But that’s not an excuse to give up, and I’m hoping we can continue to attack the problem, both in the micro and the macro, for 3.8 and beyond, because the alternative is that Python becomes less popular as an implementation language for CLIs. That would be sad, and definitely has a long term impact on Python’s popularity. Cheers, -Barry

Gregory P. Smith

12:02 a.m.

On Wed, May 2, 2018 at 2:13 PM, Barry Warsaw <barry@python.org> wrote:

...

Thanks for bringing this topic up again. At $day_job, this is a highly visible and important topic, since the majority of our command line tools are written in Python (of varying versions from 2.7 to 3.6). Some of those tools can take upwards of 5 seconds or more just to respond to —help, which causes lots of pain for developers, who complain (rightly so) up the management chain. ;)

We’ve done a fair bit of work to bring those numbers down without super radical workarounds. Often there are problems not strictly related to the Python interpreter that contribute to this. Python gets blamed, but it’s not always the interpreter’s fault. Common issues include:

* Modules that have import-time side effects, such as network access or expensive creation of data structures. Python 3.7’s `-X importtime` switch is a really wonderful way to identify the worst offenders. Once 3.7 is released, I do plan to spend some time using this to collect data internally so we can attack our own libraries, and perhaps put automated performance testing into our build stack, to identify start up time regressions.

* pkg_resources. When you have tons of entries on sys.path, pkg_resources does a lot of work at import time, and because of common patterns which tend to use pkg_resources namespace package support in __init__.py files, this just kills start up times. Of course, pkg_resources has other uses too, so even in a purely Python 3 world (where your namespace packages can omit the __init__.py), you’ll often get clobbered as soon as you want to use the Basic Resource Access API. This is also pretty common, and it’s the main reason why Brett and I created importlib.resources for 3.7 (with a standalone API-compatible library for older Pythons). That’s one less reason to use pkg_resources, but it doesn’t address the __init__.py use. Brett and I have been talking about addressing that for 3.8.

* pex - which we use as our single file zipapp tool. Especially the interaction between pex and pkg_resources introduces pretty significant overhead. My colleague Loren Carvalho created a tool called shiv which requires at least Python 3.6, avoids the use of pkg_resources, and implements other tricks to be much more performant than pex. Shiv is now open source and you can find it on RTD and GitHub.

The switch to shiv and importlib.resources can shave 25-50% off of warm cache start up times for zipapp style executables.

Another thing we’ve done, although I’m much less sanguine about them as a general approach, is to move imports into functions, but we’re trying to only use that trick on the most critical cases.

Some import time effects can’t be changed. Decorators come to mind, and click is a popular library for CLIs that provides some great features, but decorators do prevent a lazy loading approach.

...
On May 1, 2018, at 20:26, Gregory Szorc <gregory.szorc@gmail.com> wrote:

...
...
You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

Oh, for command line tools, milliseconds absolutely matter.

...
As a concrete example, I recently landed a Mercurial patch [2] that stubs out zope.interface to prevent the import of 9 modules on every `hg` invocation.

I have a similar dastardly plan to provide a pkg_resources stub :).

...
Mercurial provides a `chg` program that essentially spins up a daemon `hg` process running a "command server" so the `chg` program [written in C - no startup overhead] can dispatch commands to an already-running Python/`hg` process and avoid paying the startup overhead cost. When you run Mercurial's test suite using `chg`, it completes *minutes* faster. `chg` exists mainly as a workaround for slow startup overhead.

A couple of our developers demoed a similar approach for one of our CLIs that almost everyone uses. It’s a big application with lots of dependencies, so particularly vulnerable to pex and pkg_resources overhead. While it was just a prototype, it was darn impressive to see subsequent invocations produce output almost immediately. It’s unfortunate that we have to utilize all these tricks to get even moderately performant Python CLIs.

Note that this kind of "trick" is not unique to Python. I see it used by large Java tools at work. In effect emacs has done similar things for many decades with its saved core-dump at build time. It saves a snapshot of initialized elisp interpreter state and loads that into memory instead of rerunning initialization to reproduce the state. I don't know if anyone has looked at making a similar concept of saved post-startup interpreter state for rapid loading as a memory image possible in Python. I'm don't believe we're even at the point where all state can actually accurately be captured from CPython (extension modules can do anything). When you do that kind of trick things like hash randomization tend to complicate matters and might need to be disabled. That feature may not matter for all CLI tools. -gps A few of us spent some time at last year’s core Python dev talking about

...

other things we could do to improve Python’s start up time, not just with the interpreter itself, but within the larger context of the Python ecosystem. Many ideas seem promising until you dive into the details, so it’s definitely hard to imagine maintaining all of Python’s dynamic semantics and still making it an order of magnitude faster to start up. But that’s not an excuse to give up, and I’m hoping we can continue to attack the problem, both in the micro and the macro, for 3.8 and beyond, because the alternative is that Python becomes less popular as an implementation language for CLIs. That would be sad, and definitely has a long term impact on Python’s popularity.

Cheers, -Barry

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg% 40krypto.org

INADA Naoki

3:57 a.m.

Recently, I reported how stdlib slows down `import requests`. https://github.com/requests/requests/issues/4315#issuecomment-385584974 For Python 3.8, my ideas for faster startup time are: * Add lazy compiling API or flag in `re` module. The pattern is compiled when first used. * Add IntEnum and IntFlag alternative in C, like PyStructSequence for namedtuple. It will make importing `socket` and `ssl` module much faster. (Both module has huge enum/flag). * Add special casing for UTF-8 and ASCII in TextIOWrapper. When application uses only UTF-8 or ASCII, we can skip importing codecs and encodings package entirely. * Add faster and simpler http.parser (maybe, based on h11 [1]) and avoid using email module in http module. [1]: https://h11.readthedocs.io/en/latest/ I don't have significant estimate how they can make `import requests` faster, but I believe most of these ideas are worth enough. Regards,

Nathaniel Smith

6:59 p.m.

On Wed, May 2, 2018, 20:59 INADA Naoki <songofacandy@gmail.com> wrote:

...

Recently, I reported how stdlib slows down `import requests`. https://github.com/requests/requests/issues/4315#issuecomment-385584974

[...]

...

* Add faster and simpler http.parser (maybe, based on h11 [1]) and avoid using email module in http module.

It's always risky making predictions, but hopefully by the time 3.8 is out, requests will have switched to using h11 directly instead of the http module. (Kenneth wants the big headline feature for the next major requests release to be async support, and that pretty much requires switching to something like h11.) I don't know how fast importing h11 is though... It does currently compile a bunch of regexps at import time :-). -n

Lukasz Langa

12:22 a.m.

...

On May 2, 2018, at 8:57 PM, INADA Naoki <songofacandy@gmail.com> wrote:

Recently, I reported how stdlib slows down `import requests`. https://github.com/requests/requests/issues/4315#issuecomment-385584974

For Python 3.8, my ideas for faster startup time are:

* Add lazy compiling API or flag in `re` module. The pattern is compiled when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around. I do see value in being less eager in Python but I think the real wins are hiding behind ahead-of-time compilation. - Ł

Gregory P. Smith

12:43 a.m.

On Thu, May 3, 2018 at 5:22 PM, Lukasz Langa <lukasz@langa.pl> wrote:

...

...
On May 2, 2018, at 8:57 PM, INADA Naoki <songofacandy@gmail.com> wrote:

Recently, I reported how stdlib slows down `import requests`. https://github.com/requests/requests/issues/4315#issuecomment-385584974

For Python 3.8, my ideas for faster startup time are:

* Add lazy compiling API or flag in `re` module. The pattern is compiled when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

I do see value in being less eager in Python but I think the real wins are hiding behind ahead-of-time compilation.

Agreed in concept. We've got a lot of unused letters that could be new string prefixes... (ugh) I'd also like to see this concept somehow extended to decorators so that the results of the decoration can be captured in the compiled pyc rather than requiring execution at import time. I realize that limits what decorators can do, but the evil things they could do that this would eliminate are things they just shouldn't be doing in most situations. meaning: there would probably be two types of decorators... colons seem to be all the rage these days so we could add an @: operator for that. :P ... Along with a from __future__ import to change the behavior or all decorators in a file from runtime to compile time by default. from __future__ import compile_time_decorators # we'd be unlikely to ever change the default and break things, __future__ seems wrong @this_happens_at_compile_time(3) def ... @:this_waits_until_runtime(5) def ... Just a not-so-wild idea, no idea if this should become a PEP for 3.8. (the : syntax is a joke - i'd prefer @@ so it looks like eyeballs) If this were done to decorators, you can imagine extending that concept to something similar to allow compile time re.compile calls as some form of assignment decorator: GREGS_RE = @re.compile(r'A regex compiled at compile time\. number = \d+') -gps

...

- Ł _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ greg%40krypto.org

Chris Angelico

12:55 a.m.

On Fri, May 4, 2018 at 10:43 AM, Gregory P. Smith <greg@krypto.org> wrote:

...

I'd also like to see this concept somehow extended to decorators so that the results of the decoration can be captured in the compiled pyc rather than requiring execution at import time. I realize that limits what decorators can do, but the evil things they could do that this would eliminate are things they just shouldn't be doing in most situations. meaning: there would probably be two types of decorators... colons seem to be all the rage these days so we could add an @: operator for that. :P ... Along with a from __future__ import to change the behavior or all decorators in a file from runtime to compile time by default.

from __future__ import compile_time_decorators # we'd be unlikely to ever change the default and break things, __future__ seems wrong

@this_happens_at_compile_time(3) def ...

@:this_waits_until_runtime(5) def ...

Just a not-so-wild idea, no idea if this should become a PEP for 3.8. (the : syntax is a joke - i'd prefer @@ so it looks like eyeballs)

At this point, we're squarely in python-ideas territory, but there are some possibilities. Imagine popping this line of code at the bottom of your file: import importlib; importlib.freeze_module() as a declaration that the dictionary for this module is now locked in and can be dumped out in whatever form is most efficient. Effectively, you're stating that you do not need any sort of dynamism (that call could be easily disabled for testing), and that, if the optimization breaks anything, you accept responsibility for it. How this would be implemented, I'm not sure, but that's no different from the @: idea. ChrisA

Chris Jerdonek

1:44 a.m.

FYI, a lot of these ideas were discussed back in September and October of 2017 on this list if you search the subject lines for "startup" e.g. starting here and here: https://mail.python.org/pipermail/python-dev/2017-September/149150.html https://mail.python.org/pipermail/python-dev/2017-October/149670.html At the end Guido kicked (at least part of) the discussion back to python-ideas. --Chris On Thu, May 3, 2018 at 5:55 PM, Chris Angelico <rosuav@gmail.com> wrote:

...

...
I'd also like to see this concept somehow extended to decorators so that

On Fri, May 4, 2018 at 10:43 AM, Gregory P. Smith <greg@krypto.org> wrote: the

...
results of the decoration can be captured in the compiled pyc rather than requiring execution at import time. I realize that limits what decorators can do, but the evil things they could do that this would eliminate are things they just shouldn't be doing in most situations. meaning: there would probably be two types of decorators... colons seem to be all the rage these days so we could add an @: operator for that. :P ... Along with a from __future__ import to change the behavior or all decorators in a file from runtime to compile time by default.

from __future__ import compile_time_decorators # we'd be unlikely to ever change the default and break things, __future__ seems wrong

@this_happens_at_compile_time(3) def ...

@:this_waits_until_runtime(5) def ...

Just a not-so-wild idea, no idea if this should become a PEP for 3.8. (the : syntax is a joke - i'd prefer @@ so it looks like eyeballs)

At this point, we're squarely in python-ideas territory, but there are some possibilities. Imagine popping this line of code at the bottom of your file:

import importlib; importlib.freeze_module()

as a declaration that the dictionary for this module is now locked in and can be dumped out in whatever form is most efficient. Effectively, you're stating that you do not need any sort of dynamism (that call could be easily disabled for testing), and that, if the optimization breaks anything, you accept responsibility for it.

How this would be implemented, I'm not sure, but that's no different from the @: idea.

ChrisA _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ chris.jerdonek%40gmail.com

Neil Schemenauer

4:28 p.m.

On 2018-05-03, Lukasz Langa wrote:

...

...
On May 2, 2018, at 8:57 PM, INADA Naoki <songofacandy@gmail.com> wrote: * Add lazy compiling API or flag in `re` module. The pattern is compiled when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

Lisp has a special form 'eval-when'. It can be used to cause evaluation of the body expression at compile time. In Carl's "A fast startup patch" post, he talks about getting rid of the unmarshal step and storing objects in the heap segment of the executable. Those would be the objects necessary to evaluate code. The marshal module has a limited number of types that it handle. I believe they are: bool, bytes, code objects, complex, Ellipsis float, frozenset, int, None, tuple and str. If the same mechanism could handle more types, rather than storing the code to be evaluated, we could store the objects created after evaluation of the top-level module body. Or, have a mechanism to mark which code should be evaluated at compile time (much like the eval-when form). For the re.compile example, the compiled regex could be what is stored after compiling the Python module (i.e. the re.compile gets run at compile time). The objects created by re.compile (e.g. SRE_Pattern) would have to be something that the heap dumper could handle. Traditionally, Python has had the model "there is only runtime". So, starting to do things at compile time complicates that model. Regards, Neil

Chris Barker - NOAA Federal

2:38 p.m.

Inspired by chg: Could one make a little startup utility that, when invoked the first time, starts up a raw python interpreter, keeps it running somewhere, and then forks it to run the actual python code. Then every invocation after that would make a new fork. I presume forking is a LOT faster than re-invoking the entire startup. I suspect that many of the cases where startup time really matters is when a command line utility is likely to be invoked many times — often in the same shell instance. So having a “pre-built” warm interpreter ready to go could really help. This is way past my technical expertise to know if it’s possible, or to try to prototype it, but I’m sure many of you would know. -CHB Sent from my iPhone

...

On May 7, 2018, at 12:28 PM, Neil Schemenauer <nas-python@arctrix.com> wrote:

On 2018-05-03, Lukasz Langa wrote:

...
...
On May 2, 2018, at 8:57 PM, INADA Naoki <songofacandy@gmail.com> wrote: * Add lazy compiling API or flag in `re` module. The pattern is compiled when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

Lisp has a special form 'eval-when'. It can be used to cause evaluation of the body expression at compile time.

In Carl's "A fast startup patch" post, he talks about getting rid of the unmarshal step and storing objects in the heap segment of the executable. Those would be the objects necessary to evaluate code. The marshal module has a limited number of types that it handle. I believe they are: bool, bytes, code objects, complex, Ellipsis float, frozenset, int, None, tuple and str.

If the same mechanism could handle more types, rather than storing the code to be evaluated, we could store the objects created after evaluation of the top-level module body. Or, have a mechanism to mark which code should be evaluated at compile time (much like the eval-when form).

For the re.compile example, the compiled regex could be what is stored after compiling the Python module (i.e. the re.compile gets run at compile time). The objects created by re.compile (e.g. SRE_Pattern) would have to be something that the heap dumper could handle.

Traditionally, Python has had the model "there is only runtime". So, starting to do things at compile time complicates that model.

Regards,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov

Ryan Gonzalez

3:05 p.m.

<plug> https://refi64.com/uprocd/ </plug> On May 11, 2018 9:39:28 AM Chris Barker - NOAA Federal via Python-Dev <python-dev@python.org> wrote:

...

Inspired by chg:

Could one make a little startup utility that, when invoked the first time, starts up a raw python interpreter, keeps it running somewhere, and then forks it to run the actual python code.

Then every invocation after that would make a new fork. I presume forking is a LOT faster than re-invoking the entire startup.

I suspect that many of the cases where startup time really matters is when a command line utility is likely to be invoked many times — often in the same shell instance.

So having a “pre-built” warm interpreter ready to go could really help.

This is way past my technical expertise to know if it’s possible, or to try to prototype it, but I’m sure many of you would know.

-CHB

Sent from my iPhone

...
On May 7, 2018, at 12:28 PM, Neil Schemenauer <nas-python@arctrix.com> wrote:

On 2018-05-03, Lukasz Langa wrote:

...
...
On May 2, 2018, at 8:57 PM, INADA Naoki <songofacandy@gmail.com> wrote: * Add lazy compiling API or flag in `re` module. The pattern is compiled when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

Lisp has a special form 'eval-when'. It can be used to cause evaluation of the body expression at compile time.

In Carl's "A fast startup patch" post, he talks about getting rid of the unmarshal step and storing objects in the heap segment of the executable. Those would be the objects necessary to evaluate code. The marshal module has a limited number of types that it handle. I believe they are: bool, bytes, code objects, complex, Ellipsis float, frozenset, int, None, tuple and str.

If the same mechanism could handle more types, rather than storing the code to be evaluated, we could store the objects created after evaluation of the top-level module body. Or, have a mechanism to mark which code should be evaluated at compile time (much like the eval-when form).

For the re.compile example, the compiled regex could be what is stored after compiling the Python module (i.e. the re.compile gets run at compile time). The objects created by re.compile (e.g. SRE_Pattern) would have to be something that the heap dumper could handle.

Traditionally, Python has had the model "there is only runtime". So, starting to do things at compile time complicates that model.

Regards,

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov

Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

Chris Barker

4:26 p.m.

On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rymg19@gmail.com> wrote:

...

<plug> https://refi64.com/uprocd/ </plug>

very cool -- but *nix only, of course :-( But it seems that there is a demand for this sort of thing, and a few major projects are rolling their own. So maybe it makes sense to put something into the standard library that everyone could contribute to and use. With regard to forking -- is there another way? I don't have the expertise to have any idea if this is possible, but: start up python capture the entire runtime image as a single binary blob. could that blob be simply loaded into memory and run? (hmm -- probably not -- memory addresses would be hard-coded then, yes?) or is memory virtualized enough these days? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

INADA Naoki

4:33 p.m.

On Tue, May 15, 2018 at 1:29 AM Chris Barker via Python-Dev < python-dev@python.org> wrote:

...

On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rymg19@gmail.com> wrote:

...

...
<plug> https://refi64.com/uprocd/ </plug>

...

very cool -- but *nix only, of course :-(

...

But it seems that there is a demand for this sort of thing, and a few major projects are rolling their own. So maybe it makes sense to put something into the standard library that everyone could contribute to and use.

...

With regard to forking -- is there another way? I don't have the expertise to have any idea if this is possible, but:

...

start up python

...

capture the entire runtime image as a single binary blob.

...

could that blob be simply loaded into memory and run?

...

(hmm -- probably not -- memory addresses would be hard-coded then, yes?) or is memory virtualized enough these days?

...

-CHB

It will broke hash randomization. See also: https://www.cvedetails.com/cve/CVE-2017-11499/ Regards, -- Inada Naoki

Chris Barker

4:38 p.m.

On Mon, May 14, 2018 at 12:33 PM, INADA Naoki <songofacandy@gmail.com> wrote:

...

It will broke hash randomization.

See also: https://www.cvedetails.com/cve/CVE-2017-11499/

I'm not enough of a security expert to know how much that matters in this case, but I suppose one could do a bit of post-proccessing on the image to randomize the hashes? or is that just insane? Also -- I wasn't thinking it would be a pre-build binary blob that everyone used -- but one built on the fly on an individual system, maybe once per reboot, or once per shell instance even. So if you are running, e.g. hg a bunch of times in a shell, does it matter that the instances are all identical? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Antoine Pitrou

4:57 p.m.

On Tue, 15 May 2018 01:33:18 +0900 INADA Naoki <songofacandy@gmail.com> wrote:

...

It will broke hash randomization.

See also: https://www.cvedetails.com/cve/CVE-2017-11499/

I don't know why it would. The mechanism of pre-initializing a process which is re-used accross many requests is how most server applications of Python already work (you don't want to bear the cost of spawning a new interpreter for each request, as antiquated CGI does). I have not heard that it breaks hash randomization, so a similar mechanism on the CLI side shouldn't break it either. Regards Antoine.

INADA Naoki

5:12 p.m.

I'm sorry, the word *will* may be stronger than I thought. I meant if memory image dumped on disk is used casually, it may make easier to make security hole. For example, if `hg` memory image is reused, and it can be leaked in some way, hg serve will be hashdos weak. I don't deny that it's useful and safe when it's used carefully. Regards, On Tue, May 15, 2018 at 1:58 AM Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Tue, 15 May 2018 01:33:18 +0900 INADA Naoki <songofacandy@gmail.com> wrote:

...
It will broke hash randomization.

See also: https://www.cvedetails.com/cve/CVE-2017-11499/

...

I don't know why it would. The mechanism of pre-initializing a process which is re-used accross many requests is how most server applications of Python already work (you don't want to bear the cost of spawning a new interpreter for each request, as antiquated CGI does). I have not heard that it breaks hash randomization, so a similar mechanism on the CLI side shouldn't break it either.

...

Regards

...

Antoine.

...

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com

-- -- INADA Naoki <songofacandy@gmail.com>

Antoine Pitrou

5:17 p.m.

Le 14/05/2018 à 19:12, INADA Naoki a écrit :

...

I'm sorry, the word *will* may be stronger than I thought.

I meant if memory image dumped on disk is used casually, it may make easier to make security hole.

For example, if `hg` memory image is reused, and it can be leaked in some way, hg serve will be hashdos weak.

This discussion subthread is not about having a memory image dumped on disk, but a daemon utility that preloads a new Python process when you first start up your CLI application. Each time a new process is preloaded, it will by construction use a new hash seed. (by contrast, the Node.js CVE issue you linked to is about having the same hash seed accross a Node.js version; that's disastrous) Also you add a reuse limit to ensure that the hash seed is rotated (e.g. every 100 invocations). Regards Antoine.

INADA Naoki

5:34 p.m.

2018年5月15日(火) 2:17 Antoine Pitrou <antoine@python.org>:

...

Le 14/05/2018 à 19:12, INADA Naoki a écrit :

...
I'm sorry, the word *will* may be stronger than I thought.

I meant if memory image dumped on disk is used casually, it may make easier to make security hole.

For example, if `hg` memory image is reused, and it can be leaked in some way, hg serve will be hashdos weak.

This discussion subthread is not about having a memory image dumped on disk, but a daemon utility that preloads a new Python process when you first start up your CLI application. Each time a new process is preloaded, it will by construction use a new hash seed.

My reply was to:

...

capture the entire runtime image as a single binary blob. could that blob be simply loaded into memory and run?

So I thought about reusing memory image undeterministic times. Of course, prefork is much safer because hash initial vector is only in process ram. Regards,

Oleg Broytman

5:51 p.m.

On Mon, May 14, 2018 at 12:26:19PM -0400, Chris Barker via Python-Dev <python-dev@python.org> wrote:

...

With regard to forking -- is there another way? I don't have the expertise to have any idea if this is possible, but:

start up python

capture the entire runtime image as a single binary blob. could that blob be simply loaded into memory and run?

Like emacs unexec? https://www.google.com/search?q=emacs+unexec

...

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

M.-A. Lemburg

7:41 p.m.

On 14.05.2018 18:26, Chris Barker via Python-Dev wrote:

...

On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rymg19@gmail.com <mailto:rymg19@gmail.com>> wrote:

<plug> https://refi64.com/uprocd/ </plug>

very cool -- but *nix only, of course :-(

But it seems that there is a demand for this sort of thing, and a few major projects are rolling their own. So maybe it makes sense to put something into the standard library that everyone could contribute to and use.

With regard to forking -- is there another way? I don't have the expertise to have any idea if this is possible, but:

start up python

capture the entire runtime image as a single binary blob.

could that blob be simply loaded into memory and run?

(hmm -- probably not -- memory addresses would be hard-coded then, yes?) or is memory virtualized enough these days?

You might want to look into combining this with PyRun: https://www.egenix.com/products/python/PyRun/ which takes care of mmap'ing the byte code of the stdlib into memory. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts

...

...
...
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Oleg Broytman

3:27 p.m.

New subject: Python startup time - daemon

On Fri, May 11, 2018 at 07:38:05AM -0700, Chris Barker - NOAA Federal via Python-Dev <python-dev@python.org> wrote:

...

Could one make a little startup utility that, when invoked the first time, starts up a raw python interpreter, keeps it running somewhere, and then forks it to run the actual python code.

Then every invocation after that would make a new fork.

Used to be implemented (and discussed in this list) many times. Just a few examples: http://readyexec.sourceforge.net/ https://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/ Proven to be hard and never gain any traction. a) you don't want the daemon to import all possible modules so you need to run a separate copy of the daemon for every Python version, every user and every client program; b) you need to find "your" daemon - using TCP? unix sockets? named pipes? b) need to redirect stdio to/from the daemon; c) need to redirect signals and exceptions; d) have problems with elevated privileges (how do you elevate the daemon if the client was started with `sudo -H`?); e) not portable (there is a popular GUI that cannot fork).

...

-CHB Sent from my iPhone

Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Antoine Pitrou

3:34 p.m.

New subject: Python startup time - daemon

Yes, you don't want this to be a generic utility, rather a helper library that people can integrate into their command-line applications to enable such startup caching. Regards Antoine. On Fri, 11 May 2018 17:27:35 +0200 Oleg Broytman <phd@phdru.name> wrote:

...

On Fri, May 11, 2018 at 07:38:05AM -0700, Chris Barker - NOAA Federal via Python-Dev <python-dev@python.org> wrote:

...
Could one make a little startup utility that, when invoked the first time, starts up a raw python interpreter, keeps it running somewhere, and then forks it to run the actual python code.

Then every invocation after that would make a new fork.

Used to be implemented (and discussed in this list) many times. Just a few examples:

http://readyexec.sourceforge.net/ https://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/

Proven to be hard and never gain any traction.

a) you don't want the daemon to import all possible modules so you need to run a separate copy of the daemon for every Python version, every user and every client program; b) you need to find "your" daemon - using TCP? unix sockets? named pipes? b) need to redirect stdio to/from the daemon; c) need to redirect signals and exceptions; d) have problems with elevated privileges (how do you elevate the daemon if the client was started with `sudo -H`?); e) not portable (there is a popular GUI that cannot fork).

...
-CHB Sent from my iPhone

Oleg.

Guido van Rossum

4:23 p.m.

New subject: Python startup time - daemon

Indeed, we have an implementation of this specific to mypy. On Fri, May 11, 2018 at 11:34 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

Yes, you don't want this to be a generic utility, rather a helper library that people can integrate into their command-line applications to enable such startup caching.

Regards

Antoine.

On Fri, 11 May 2018 17:27:35 +0200 Oleg Broytman <phd@phdru.name> wrote:

...
On Fri, May 11, 2018 at 07:38:05AM -0700, Chris Barker - NOAA Federal via Python-Dev <python-dev@python.org> wrote:

...
Could one make a little startup utility that, when invoked the first time, starts up a raw python interpreter, keeps it running somewhere, and then forks it to run the actual python code.

Then every invocation after that would make a new fork.

Used to be implemented (and discussed in this list) many times. Just a few examples:

http://readyexec.sourceforge.net/ https://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/

Proven to be hard and never gain any traction.

a) you don't want the daemon to import all possible modules so you need to run a separate copy of the daemon for every Python version, every user and every client program; b) you need to find "your" daemon - using TCP? unix sockets? named pipes? b) need to redirect stdio to/from the daemon; c) need to redirect signals and exceptions; d) have problems with elevated privileges (how do you elevate the daemon if the client was started with `sudo -H`?); e) not portable (there is a popular GUI that cannot fork).

...
-CHB Sent from my iPhone

Oleg.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Barry Warsaw

3:57 a.m.

New subject: Python startup time - daemon

On May 11, 2018, at 12:23, Guido van Rossum <guido@python.org> wrote:

...

Indeed, we have an implementation of this specific to mypy.

Is there anything in mypy’s implementation that can be generalized into a library? -Barry

Guido van Rossum

4:08 a.m.

New subject: Python startup time - daemon

On Fri, May 11, 2018 at 11:57 PM, Barry Warsaw <barry@python.org> wrote:

...

On May 11, 2018, at 12:23, Guido van Rossum <guido@python.org> wrote:

...
Indeed, we have an implementation of this specific to mypy.

Is there anything in mypy’s implementation that can be generalized into a library?

Not sure, here's the code: https://github.com/python/mypy/blob/master/mypy/dmypy.py https://github.com/python/mypy/blob/master/mypy/dmypy_server.py (also dmypy_util.py there) -- --Guido van Rossum (python.org/~guido)

Gregory Szorc

October 2018

9:02 p.m.

On 5/1/2018 8:26 PM, Gregory Szorc wrote:

...

On 7/19/2017 12:15 PM, Larry Hastings wrote:

...
On 07/19/2017 05:59 AM, Victor Stinner wrote:

...
Mercurial startup time is already 45.8x slower than Git whereas tested Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being /even slower/ to start was absolutely one of the reasons why they didn't want to upgrade.

You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

...
So please continue efforts for make Python startup even faster to beat all other programming languages, and finally convince Mercurial to upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3.

https://www.mercurial-scm.org/wiki/Python3

Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate.

I just now found found this thread when searching the archive for threads about startup time. And I was searching for threads about startup time because Mercurial's startup time has been getting slower over the past few months and this is causing substantial pain.

As I posted back in 2014 [1], CPython's startup overhead was >10% of the total CPU time in Mercurial's test suite. And when you factor in the time to import modules that get Mercurial to a point where it can run commands, it was more like 30%!

Mercurial's full test suite currently runs `hg` ~25,000 times. Using Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for 3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s on 3.7/master. Even if you divide this by the number of available CPU cores, we're talking dozens of seconds of wall time just waiting for CPython to get to a place where Mercurial's first bytecode can execute.

And the problem is worse when you factor in the time it takes to import Mercurial's own modules.

As a concrete example, I recently landed a Mercurial patch [2] that stubs out zope.interface to prevent the import of 9 modules on every `hg` invocation. This "only" saved ~6.94ms for a typical `hg` invocation. But this decreased the CPU time required to run the test suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)!

By the time CPython gets Mercurial to a point where we can run useful code, we've already blown most of or past the time budget where humans perceive an action/command as instantaneous. If you ignore startup overhead, Mercurial's performance compares quite well to Git's for many operations. But the reality is that CPython startup overhead makes it look like Mercurial is non-instantaneous before Mercurial even has the opportunity to execute meaningful code!

Mercurial provides a `chg` program that essentially spins up a daemon `hg` process running a "command server" so the `chg` program [written in C - no startup overhead] can dispatch commands to an already-running Python/`hg` process and avoid paying the startup overhead cost. When you run Mercurial's test suite using `chg`, it completes *minutes* faster. `chg` exists mainly as a workaround for slow startup overhead.

Changing gears, my day job is maintaining Firefox's build system. We use Python heavily in the build system. And again, Python startup overhead is problematic. I don't have numbers offhand, but we invoke likely a few hundred Python processes as part of building Firefox. It should be several thousand. But, we've had to "hack" parts of the build system to "batch" certain build actions in single process invocations in order to avoid Python startup overhead. This undermines the ability of some build tools to formulate a reasonable understanding of the DAG and it causes a bit of pain for build system developers and makes it difficult to achieve "no-op" and fast incremental builds because we're always invoking certain Python processes because we've had to move DAG awareness out of the build backend and into Python. At some point, we'll likely replace Python code with Rust so the build system is more "pure" and easier to maintain and reason about.

I've seen posts in this thread and elsewhere in the CPython development universe that challenge whether milliseconds in startup time matter. Speaking as a Mercurial and Firefox build system developer, *milliseconds absolutely matter*. Going further, *fractions of milliseconds matter*. For Mercurial's test suite with its ~25,000 Python process invocations, 1ms translates to ~25s of CPU time. With 2.7, Mercurial can dispatch commands in ~50ms. When you load common extensions, it isn't uncommon to see process startup overhead of 100-150ms! A millisecond here. A millisecond there. Before you know it, we're talking *minutes* of CPU (and potentially wall) time in order to run Mercurial's test suite (or build Firefox, or ...).

From my perspective, Python process startup and module import overhead is a severe problem for Python. I don't say this lightly, but in my mind the problem causes me to question the viability of Python for popular use cases, such as CLI applications. When choosing a programming language, I want one that will scale as a project grows. Vanilla process overhead has Python starting off significantly slower than compiled code (or even Perl) and adding module import overhead into the mix makes Python slower and slower as projects grow. As someone who has to deal with this slowness on a daily basis, I can tell you that it is extremely frustrating and it does matter. I hope that the importance of the problem will be acknowledged (milliseconds *do* matter) and that creative minds will band together to address it. Since I am disproportionately impacted by this issue, if there's anything I can do to help, let me know.

We were debugging abysmally slow execution of Mercurial's test harness on macOS and we discovered a new wrinkle to the startup time problem. It appears that APFS acquires some shared locks/mutexes in the kernel when executing readdir() and other filesystem system calls. When you have several Python processes all starting at the same time, I/O attached to module importing (import.c:case_ok() by the looks of it for Python 2.7) becomes a stress test of sorts for this lock acquisition. On my 6+6 core MacBook Pro, ~75% of overall system CPU is spent in the kernel when executing the test harness with 12 parallel tests. If we run the test harness with the persistent `chg` command server (which eliminates Python process startup overhead), wall execution time drops from ~37:43s to ~9:06s. This problem of shared locks on read-only operations appears to be similar to that of AUFS, which I've blogged about [1]. It is pretty common for non-compiled languages (like Python, Ruby, PHP, Perl, etc) to stat() the world as part of looking for modules to load. Typically, the filesystem's stat cache will save you and the overhead from hundreds or thousands of lookups is trivial (after first load). But it appears APFS is quite sensitive to it. Any work to reduce the number of filesystem API calls during Python startup will likely have a profound impact on APFS when multiple Python processes are starting. A "frozen" application where modules are in a shared container file is probably ideal. Python 3.7 doesn't exhibit as much of a problem. But it is still there. A brief audit of the importer code and call stacks confirms it is the same problem - just less prevalent. Wall time execution of the test harness from Python 2.7 to Python 3.7 drops from ~37:43s to ~20:39. Overall kernel CPU time drops from ~75% to ~19%. And that wall time improvement is despite Python 3's slower process startup. So locking in the kernel is really a killer on Python 2.7. While we're here, CPython might want to look into getdirentriesattr() as a replacement for readdir(). We switched to it in Mercurial several years ago to make `hg status` operations significantly faster [2]. I'm not sure if it will yield a speedup on APFS though. But it's worth a try. (If it does, you could probably make os.listdir()/os.scandir()/os.walk() significantly faster on macOS.) I hope someone finds this information useful to further improving [startup] performance. (And given that Python 3.7 is substantially faster by avoiding excessive readdir(), I wouldn't be surprised if this problem is already known!) [1] https://gregoryszorc.com/blog/2017/12/08/good-riddance-to-aufs/ [2] https://www.mercurial-scm.org/repo/hg/rev/05ccfe6763f1

Antoine Pitrou

9:23 p.m.

Hi, On Tue, 9 Oct 2018 14:02:02 -0700 Gregory Szorc <gregory.szorc@gmail.com> wrote:

...

Python 3.7 doesn't exhibit as much of a problem. But it is still there. A brief audit of the importer code and call stacks confirms it is the same problem - just less prevalent. Wall time execution of the test harness from Python 2.7 to Python 3.7 drops from ~37:43s to ~20:39. Overall kernel CPU time drops from ~75% to ~19%. And that wall time improvement is despite Python 3's slower process startup. So locking in the kernel is really a killer on Python 2.7.

Thanks for the detailed feedback.

...

I hope someone finds this information useful to further improving [startup] performance. (And given that Python 3.7 is substantially faster by avoiding excessive readdir(), I wouldn't be surprised if this problem is already known!)

The macOS problem wasn't known, but the general problem of filesystem calls was (in relation with e.g. networked filesystems). Significant work went into improving Python 3 in that regard after the import mechanism was rewritten in pure Python. Nowadays Python caches the contents of all sys.path directories, so (once the cache is primed) it's mostly a single stat() call per directory to check whether the cache is up-to-date. This is not entirely free, but massively better than what Python 2 did, which was to stat() many filename patterns in each sys.path directory. (of course, the fact that Python 3 imports many more modules at startup mitigates the end result a bit) As a sidenote, I was always shocked with how the Mercurial test suite was architected. You're wasting so much time launching processes that I wonder why you kept it that way for so long :-) Regards Antoine.

Ronald Oussoren

8:10 a.m.

...

On 9 Oct 2018, at 23:02, Gregory Szorc <gregory.szorc@gmail.com> wrote:

While we're here, CPython might want to look into getdirentriesattr() as a replacement for readdir(). We switched to it in Mercurial several years ago to make `hg status` operations significantly faster [2]. I'm not sure if it will yield a speedup on APFS though. But it's worth a try. (If it does, you could probably make os.listdir()/os.scandir()/os.walk() significantly faster on macOS.)

Note that getdirentriesattr is deprecated as of macOS 10.10, getattrlistbulk is the non-deprecated replacement (introduced in 10.10). Ronald

Victor Stinner

July 2017

8:56 a.m.

Hi, I applied the patch above to count the number of times that Python is run. Running the Python test suite with "./python -m test -j0 -rW" runs Python 2,256 times. Honestly, I expected more. I'm running tests with Python compiled in debug mode. And in debug mode, Python startup time is much worse: haypo@selma$ python3 -m perf command --inherit=PYTHONPATH -v -- ./python -c pass command: Mean +- std dev: 46.4 ms +- 2.3 ms FYI I'm using gcc -O0 rather than -Og to make compilation even faster. Victor diff --git a/Lib/site.py b/Lib/site.py index 7dc1b04..4b0c167 100644 --- a/Lib/site.py +++ b/Lib/site.py @@ -540,6 +540,21 @@ def execusercustomize(): (err.__class__.__name__, err)) +def run_counter(): + import fcntl + + fd = os.open("/home/haypo/prog/python/master/run_counter", + os.O_WRONLY | os.O_CREAT | os.O_APPEND) + try: + fcntl.flock(fd, fcntl.LOCK_EX) + try: + os.write(fd, b'\x01') + finally: + fcntl.flock(fd, fcntl.LOCK_UN) + finally: + os.close(fd) + + def main(): """Add standard site-specific directories to the module search path. @@ -568,6 +583,7 @@ def main(): execsitecustomize() if ENABLE_USER_SITE: execusercustomize() + run_counter() # Prevent extending of sys.path when python was started with -S and # site is imported later.

2306

Age (days ago)

2754

Last active (days ago)

List overview

Download

96 comments

42 participants

participants (42)

Alex Walters
Antoine Pitrou
Antoine Pitrou
Barry Warsaw
Ben Hoyt
Benjamin Peterson
Brett Cannon
Cesare Di Mauro
Chris Angelico
Chris Barker
Chris Barker - NOAA Federal
Chris Jerdonek
David Mertz
Glenn Linderman
Gregory P. Smith
Gregory Szorc
Guido van Rossum
Guido van Rossum
INADA Naoki
Ivan Levkivskyi
Larry Hastings
Lukasz Langa
M.-A. Lemburg
Michel Desmoulin
Nathaniel Smith
Ned Deily
Neil Schemenauer
Nick Coghlan
Nikolaus Rath
Oleg Broytman
Paul Moore
Ray Donnelly
Ronald Oussoren
Ryan Gonzalez
Skip Montanaro
Stefan Behnel
Steve Dower
Steven D'Aprano
Terry Reedy
Victor Stinner
Victor Stinner
Zero Piraeus

Python startup time

tags

participants (42)