[Python-ideas] Move optional data out of pyc files

Wed Apr 11 02:06:42 EDT 2018

On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:

[...]
> > Yes, it will double the number of files. Actually quadruple it, if the
> > annotations and line numbers are in separate files too. But if most of
> > those extra files never need to be opened, then there's no cost to them.
> > And whatever extra cost there is, is amortized over the lifetime of the
> > interpreter.
> 
> Yes, if they are actually not needed. My question was about whether
> that is truly valid.

We're never really going to know the affect on performance without 
implementing and benchmarking the code. It might turn out that, to our 
surprise, three quarters of the std lib relies on loading docstrings 
during startup. But I doubt it.

> Consider a very common use-case: an OS-provided
> Python interpreter whose files are all owned by 'root'. Those will be
> distributed with .pyc files for performance, but you don't want to
> deprive the users of help() and anything else that needs docstrings
> etc. So... are the docstrings lazily loaded or eagerly loaded?

What relevance is that they're owned by root?

> If eagerly, you've doubled the number of file-open calls to initialize
> the interpreter.

I do not understand why you think this is even an option. Has Serhiy 
said something that I missed that makes this seem to be on the table? 
That's not a rhetorical question -- I may have missed something. But I'm 
sure he understands that doubling or quadrupling the number of file 
operations during startup is not an optimization.

> (Or quadrupled, if you need annotations and line
> numbers and they're all separate.) If lazily, things are a lot more
> complicated than the original description suggested, and there'd need
> to be some semantic changes here.

What semantic change do you expect?

There's an implementation change, of course, but that's Serhiy's problem 
to deal with and I'm sure that he has considered that. There should be 
no semantic change. When you access obj.__doc__, then and only then are 
the compiled docstrings for that module read from the disk.

I don't know the current implementation of .pyc files, but I like 
Antoine's suggestion of laying it out in four separate areas (plus 
header), each one marshalled:

    code
    docstrings
    annotations
    line numbers

Aside from code, which is mandatory, the three other sections could be 
None to represent "not available", as is the case when you pass -00 to 
the interpreter, or they could be some other sentinel that means "load 
lazily from the appropriate file", or they could be the marshalled data 
directly in place to support byte-code only libraries.

As for the in-memory data structures of objects themselves, I imagine 
something like the __doc__ and __annotation__ slots pointing to a table 
of strings, which is not initialised until you attempt to read from the 
table. Or something -- don't pay too much attention to my wild guesses.

The bottom line is, is there some reason *aside from performance* to 
avoid this? Because if the performance is worse, I'm sure Serhiy will be 
the first to dump this idea.

-- 
Steve