[Python-ideas] Move optional data out of pyc files
Steven D'Aprano
steve at pearwood.info
Wed Apr 11 02:06:42 EDT 2018
On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:
[...]
> > Yes, it will double the number of files. Actually quadruple it, if the
> > annotations and line numbers are in separate files too. But if most of
> > those extra files never need to be opened, then there's no cost to them.
> > And whatever extra cost there is, is amortized over the lifetime of the
> > interpreter.
>
> Yes, if they are actually not needed. My question was about whether
> that is truly valid.
We're never really going to know the affect on performance without
implementing and benchmarking the code. It might turn out that, to our
surprise, three quarters of the std lib relies on loading docstrings
during startup. But I doubt it.
> Consider a very common use-case: an OS-provided
> Python interpreter whose files are all owned by 'root'. Those will be
> distributed with .pyc files for performance, but you don't want to
> deprive the users of help() and anything else that needs docstrings
> etc. So... are the docstrings lazily loaded or eagerly loaded?
What relevance is that they're owned by root?
> If eagerly, you've doubled the number of file-open calls to initialize
> the interpreter.
I do not understand why you think this is even an option. Has Serhiy
said something that I missed that makes this seem to be on the table?
That's not a rhetorical question -- I may have missed something. But I'm
sure he understands that doubling or quadrupling the number of file
operations during startup is not an optimization.
> (Or quadrupled, if you need annotations and line
> numbers and they're all separate.) If lazily, things are a lot more
> complicated than the original description suggested, and there'd need
> to be some semantic changes here.
What semantic change do you expect?
There's an implementation change, of course, but that's Serhiy's problem
to deal with and I'm sure that he has considered that. There should be
no semantic change. When you access obj.__doc__, then and only then are
the compiled docstrings for that module read from the disk.
I don't know the current implementation of .pyc files, but I like
Antoine's suggestion of laying it out in four separate areas (plus
header), each one marshalled:
code
docstrings
annotations
line numbers
Aside from code, which is mandatory, the three other sections could be
None to represent "not available", as is the case when you pass -00 to
the interpreter, or they could be some other sentinel that means "load
lazily from the appropriate file", or they could be the marshalled data
directly in place to support byte-code only libraries.
As for the in-memory data structures of objects themselves, I imagine
something like the __doc__ and __annotation__ slots pointing to a table
of strings, which is not initialised until you attempt to read from the
table. Or something -- don't pay too much attention to my wild guesses.
The bottom line is, is there some reason *aside from performance* to
avoid this? Because if the performance is worse, I'm sure Serhiy will be
the first to dump this idea.
--
Steve
More information about the Python-ideas
mailing list