[Python-ideas] Move optional data out of pyc files

Steven D'Aprano steve at pearwood.info
Tue Apr 10 23:02:05 EDT 2018


On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:

> File system limits aren't usually an issue; as you say, even FAT32 can
> store a metric ton of files in a single directory. I'm more interested
> in how long it takes to open a file, and whether doubling that time
> will have a measurable impact on Python startup time. Part of that
> cost can be reduced by using openat(), on platforms that support it,
> but even with a directory handle, there's still a definite non-zero
> cost to opening and reading an additional file.

Yes, it will double the number of files. Actually quadruple it, if the 
annotations and line numbers are in separate files too. But if most of 
those extra files never need to be opened, then there's no cost to them. 
And whatever extra cost there is, is amortized over the lifetime of the 
interpreter.

The expectation here is that this could lead to reducing startup time, 
since the files which are read are smaller and less data needs to be 
read and traverse the network up front, but can be defered until they're 
actually needed.

Serhiy is experienced enough that I think we should assume he's not 
going to push this optimization into production unless it actually does 
reduce startup time. He has proven himself enough that we should assume 
competence rather than incompetence :-)

Here is the proposal as I understand it:

- by default, change .pyc files to store annotations, docstrings
  and line numbers as references to external files which will be
  lazily loaded on-need;

- single-file .pyc files must still be supported, but this won't
  be the default and could rely on an external "merge" tool;

- objects that rely on docstrings or annotations, such as dataclass,
  may experience a (hopefully very small) increase of import time,
  since they may not be able to defer loading the extra files;

- but in general, most modules should (we expect) see an decrease
  in the load time;

- which will (we hope) reduce startup time;

- libraries which make eager use of docstrings and annotations might
  even ship with the single-file .pyc instead (the library installer
  can look after that aspect), and so avoid any extra cost.

Naturally pushing this into production will require benchmarks that 
prove this actually does improve startup time. I believe that Serhiy's 
reason for asking is to determine whether it is worth his while to 
experiment on this. There's no point in implementing these changes and 
benchmarking them, if there's no chance of it being accepted.

So on the assumptions that:

- benchmarking does demonstrate a non-trivial speedup of
  interpreter startup;

- single-file .pyc files are still supported, for the use
  of byte-code only libraries;

- and modules which are particularly badly impacted by this
  change are able to opt-out and use a single .pyc file;

I see no reason not to support this idea if Serhiy (or someone else) is 
willing to put in the work.


-- 
Steve


More information about the Python-ideas mailing list