How implausible is it to write out the actual memory image of a loaded Python process? I.e. on a specific machine, OS, Python version, etc? This can only be overhead initially, of course, but on subsequent runs it's just one memory map, which the cheapest possible operation.


$ python3.7 --write-image "import typing, re, os, numpy"

I imagine this creating a file like:


Then just terminating as if just that line had run, however long it takes (but snapshotting before exit).

Then subsequent invocations would only restore the image to memory. Maybe:

$ pyrunner --load-image python37-typing-re-os-numpy

The last line could be aliased of course. I suppose we'd need to check if relevant file exists, and if not fall back to just ignoring the '--load-image' flag and running plain old Python.

This helps not at all for something like AWS Lambda where each instance is spun up fresh. But for the use-case of running many Python shell commands at an interactive shell on one machine, it seems like that could be very fast.

In my hypothetical I suppose pre-loading some collection of modules in the image. Of course, the script may need to load others, and it may not use some in the image. But users could decide their typical needed modules themselves under this idea.

On Jul 20, 2017 11:27 PM, "Nick Coghlan" <> wrote:
On 21 July 2017 at 15:30, Cesare Di Mauro <> wrote:

2017-07-21 4:52 GMT+02:00 Nick Coghlan <>:
On 21 July 2017 at 12:44, Nick Coghlan <> wrote:
> We can separately measure the cost of unmarshalling the code object:
> $ python3 -m perf timeit -s "import typing; from marshal import loads; from
> importlib.util import cache_from_source; cache =
> cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]"
> "loads(data)"
> .....................
> Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode
and reading it from disk should really be accounted for in each

$ python3 -m perf timeit -s "import typing; from marshal import loads;
from importlib.util import cache_from_source" "cache =
cache_from_source(typing.__spec__.origin); data = open(cache,
'rb').read()[12:]; loads(data)"
Mean +- std dev: 337 us +- 8 us

That will have a bigger impact when loading from spinning disk or a
network drive, but it's fairly negligible when loading from a local
SSD or an already primed filesystem cache.


Nick Coghlan   |   |   Brisbane, Australia
Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal.

It was still a good suggestion, since it made me realise I *hadn't* actually measured the relative timings lately, so it was technically an untested assumption that module level code execution still dominated the overall import time.

typing is also a particularly large & complex module, and bytecode unmarshalling represents a larger fraction of the import time for simpler modules like abc:

$ python3 -m perf timeit -s "import abc; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(abc.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)"
Mean +- std dev: 45.2 us +- 1.1 us

$ python3 -m perf timeit -s "import abc; loader_exec = abc.__spec__.loader.exec_module" "loader_exec(abc)"
Mean +- std dev: 172 us +- 5 us

$ python3 -m perf timeit -s "import abc; from importlib import reload" "reload(abc)"
Mean +- std dev: 280 us +- 14 us

And _weakrefset:

$ python3 -m perf timeit -s "import _weakrefset; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(_weakrefset.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)"
Mean +- std dev: 57.7 us +- 1.3 us

$ python3 -m perf timeit -s "import _weakrefset; loader_exec = _weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)"
Mean +- std dev: 129 us +- 6 us

$ python3 -m perf timeit -s "import _weakrefset; from importlib import reload" "reload(_weakrefset)"
Mean +- std dev: 226 us +- 4 us

The conclusion still holds (the absolute numbers here are likely still too small for the extra complexity of parallelising bytecode loading to pay off in any significant way), but it also helps us set reasonable expectations around how much of a gain we're likely to be able to get just from precompilation with Cython.

That does actually raise a small microbenchmarking problem: for source and bytecode imports, we can force the import system to genuinely rerun the module or unmarshal the bytecode inside a single Python process, allowing perf to measure it independently of CPython startup. While I'm pretty sure it's possible to trick the import machinery into rerunning module level init functions even for old-style extension modules (hence allowing us to run similar tests to those above for a Cython compiled module), I don't actually remember how to do it off the top of my head.


P.S. I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster. Being built directly into the interpreter binary is pretty much the best case scenario for reducing import overhead.

Nick Coghlan   |   |   Brisbane, Australia

Python-Dev mailing list