[Python-ideas] Move optional data out of pyc files
Steven D'Aprano
steve at pearwood.info
Tue Apr 10 23:02:05 EDT 2018
On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:
> File system limits aren't usually an issue; as you say, even FAT32 can
> store a metric ton of files in a single directory. I'm more interested
> in how long it takes to open a file, and whether doubling that time
> will have a measurable impact on Python startup time. Part of that
> cost can be reduced by using openat(), on platforms that support it,
> but even with a directory handle, there's still a definite non-zero
> cost to opening and reading an additional file.
Yes, it will double the number of files. Actually quadruple it, if the
annotations and line numbers are in separate files too. But if most of
those extra files never need to be opened, then there's no cost to them.
And whatever extra cost there is, is amortized over the lifetime of the
interpreter.
The expectation here is that this could lead to reducing startup time,
since the files which are read are smaller and less data needs to be
read and traverse the network up front, but can be defered until they're
actually needed.
Serhiy is experienced enough that I think we should assume he's not
going to push this optimization into production unless it actually does
reduce startup time. He has proven himself enough that we should assume
competence rather than incompetence :-)
Here is the proposal as I understand it:
- by default, change .pyc files to store annotations, docstrings
and line numbers as references to external files which will be
lazily loaded on-need;
- single-file .pyc files must still be supported, but this won't
be the default and could rely on an external "merge" tool;
- objects that rely on docstrings or annotations, such as dataclass,
may experience a (hopefully very small) increase of import time,
since they may not be able to defer loading the extra files;
- but in general, most modules should (we expect) see an decrease
in the load time;
- which will (we hope) reduce startup time;
- libraries which make eager use of docstrings and annotations might
even ship with the single-file .pyc instead (the library installer
can look after that aspect), and so avoid any extra cost.
Naturally pushing this into production will require benchmarks that
prove this actually does improve startup time. I believe that Serhiy's
reason for asking is to determine whether it is worth his while to
experiment on this. There's no point in implementing these changes and
benchmarking them, if there's no chance of it being accepted.
So on the assumptions that:
- benchmarking does demonstrate a non-trivial speedup of
interpreter startup;
- single-file .pyc files are still supported, for the use
of byte-code only libraries;
- and modules which are particularly badly impacted by this
change are able to opt-out and use a single .pyc file;
I see no reason not to support this idea if Serhiy (or someone else) is
willing to put in the work.
--
Steve
More information about the Python-ideas
mailing list