[Python-ideas] Move optional data out of pyc files

Steven D'Aprano steve at pearwood.info
Tue Apr 10 20:03:35 EDT 2018


On Wed, Apr 11, 2018 at 03:38:08AM +1000, Chris Angelico wrote:
> On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> > Currently pyc files contain data that is useful mostly for developing and is
> > not needed in most normal cases in stable program. There is even an option
> > that allows to exclude a part of this information from pyc files. It is
> > expected that this saves memory, startup time, and disk space (or the time
> > of loading from network). I propose to move this data from pyc files into
> > separate file or files. pyc files should contain only external references to
> > external files. If the corresponding external file is absent or specific
> > option suppresses them, references are replaced with None or NULL at import
> > time, otherwise they are loaded from external files.
> >
> > 1. Docstrings. They are needed mainly for developing.
> >
> > 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for
> > tracing, and debugging with the debugger. Sources are helpful in such cases
> > too. If the program doesn't contain errors ;-) and is sipped without
> > sources, they could be removed.
> >
> > 3. Annotations. They are used mainly by third party tools that statically
> > analyze sources. They are rarely used at runtime.
> >
> > Docstrings will be read from the corresponding docstring file unless -OO is
> > supplied. This will allow also to localize docstrings. Depending on locale
> > or other settings different docstring file can be used.
> >
> > For suppressing line numbers and annotations new options can be added.
> 
> A deployed Python distribution generally has .pyc files for all of the
> standard library. I don't think people want to lose the ability to
> call help(), and unless I'm misunderstanding, that requires
> docstrings. So this will mean twice as many files and twice as many
> file-open calls to import from the standard library. What will be the
> impact on startup time?

I shouldn't think that the number of files on disk is very important, 
now that they're hidden away in the __pycache__ directory where they can 
be ignored by humans. Even venerable old FAT32 has a limit of 65,534 
files in a single folder, and 268,435,437 on the entire volume. So 
unless the std lib expands to 16000+ modules, the number of files in the 
__pycache__ directory ought to be well below that limit.

I think even MicroPython ought to be okay with that. (But it would be 
nice to find out for sure: does it support file systems with *really* 
tiny limits?)

The entire __pycache__ directory is supposed to be a black box except 
under unusual circumstances, so it doesn't matter (at least not to me)
if we have:

    __pycache__/spam.cpython-38.pyc

alone or:

    __pycache__/spam.cpython-38.pyc
    __pycache__/spam.cpython-38-doc.pyc
    __pycache__/spam.cpython-38-lno.pyc
    __pycache__/spam.cpython-38-ann.pyc

(say). And if the external references are loaded lazily, on need, rather 
than eagerly, this could save startup time, which I think is the 
intention. The doc strings would be still available, just not loaded 
until the first time you try to use them.

However, Python supports byte-code only distribution, using .pyc files 
external to the __pycache__. In that case, it would be annoying and 
inconvenient to distribute four top-level files, so I think that the use 
of external references has to be optional, and there has to be a way to 
either compile to a single .pyc file containing all four parts, or an 
external tool that can take the existing four files and merge them.


-- 
Steve


More information about the Python-ideas mailing list