
On 21 July 2017 at 15:30, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:
2017-07-21 4:52 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan@gmail.com> wrote:
We can separately measure the cost of unmarshalling the code object:
$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)" ..................... Mean +- std dev: 286 us +- 4 us
Slight adjustment here, as the cost of locating the cached bytecode and reading it from disk should really be accounted for in each iteration:
$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(typing.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 337 us +- 8 us
That will have a bigger impact when loading from spinning disk or a network drive, but it's fairly negligible when loading from a local SSD or an already primed filesystem cache.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal.
It was still a good suggestion, since it made me realise I *hadn't* actually measured the relative timings lately, so it was technically an untested assumption that module level code execution still dominated the overall import time. typing is also a particularly large & complex module, and bytecode unmarshalling represents a larger fraction of the import time for simpler modules like abc: $ python3 -m perf timeit -s "import abc; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(abc.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 45.2 us +- 1.1 us $ python3 -m perf timeit -s "import abc; loader_exec = abc.__spec__.loader.exec_module" "loader_exec(abc)" ..................... Mean +- std dev: 172 us +- 5 us $ python3 -m perf timeit -s "import abc; from importlib import reload" "reload(abc)" ..................... Mean +- std dev: 280 us +- 14 us And _weakrefset: $ python3 -m perf timeit -s "import _weakrefset; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(_weakrefset.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)" ..................... Mean +- std dev: 57.7 us +- 1.3 us $ python3 -m perf timeit -s "import _weakrefset; loader_exec = _weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)" ..................... Mean +- std dev: 129 us +- 6 us $ python3 -m perf timeit -s "import _weakrefset; from importlib import reload" "reload(_weakrefset)" ..................... Mean +- std dev: 226 us +- 4 us The conclusion still holds (the absolute numbers here are likely still too small for the extra complexity of parallelising bytecode loading to pay off in any significant way), but it also helps us set reasonable expectations around how much of a gain we're likely to be able to get just from precompilation with Cython. That does actually raise a small microbenchmarking problem: for source and bytecode imports, we can force the import system to genuinely rerun the module or unmarshal the bytecode inside a single Python process, allowing perf to measure it independently of CPython startup. While I'm pretty sure it's possible to trick the import machinery into rerunning module level init functions even for old-style extension modules (hence allowing us to run similar tests to those above for a Cython compiled module), I don't actually remember how to do it off the top of my head. Cheers, Nick. P.S. I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster. Being built directly into the interpreter binary is pretty much the best case scenario for reducing import overhead. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia