[Python-ideas] Move optional data out of pyc files

Petr Viktorin encukou at gmail.com
Wed Apr 11 04:28:52 EDT 2018



On 04/11/18 08:06, Steven D'Aprano wrote:
> On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:
> 
> [...]
>>> Yes, it will double the number of files. Actually quadruple it, if the
>>> annotations and line numbers are in separate files too. But if most of
>>> those extra files never need to be opened, then there's no cost to them.
>>> And whatever extra cost there is, is amortized over the lifetime of the
>>> interpreter.
>>
>> Yes, if they are actually not needed. My question was about whether
>> that is truly valid.
> 
> We're never really going to know the affect on performance without
> implementing and benchmarking the code. It might turn out that, to our
> surprise, three quarters of the std lib relies on loading docstrings
> during startup. But I doubt it.
> 
> 
>> Consider a very common use-case: an OS-provided
>> Python interpreter whose files are all owned by 'root'. Those will be
>> distributed with .pyc files for performance, but you don't want to
>> deprive the users of help() and anything else that needs docstrings
>> etc. So... are the docstrings lazily loaded or eagerly loaded?
> 
> What relevance is that they're owned by root?
> 
> 
>> If eagerly, you've doubled the number of file-open calls to initialize
>> the interpreter.
> 
> I do not understand why you think this is even an option. Has Serhiy
> said something that I missed that makes this seem to be on the table?
> That's not a rhetorical question -- I may have missed something. But I'm
> sure he understands that doubling or quadrupling the number of file
> operations during startup is not an optimization.
> 
> 
>> (Or quadrupled, if you need annotations and line
>> numbers and they're all separate.) If lazily, things are a lot more
>> complicated than the original description suggested, and there'd need
>> to be some semantic changes here.
> 
> What semantic change do you expect?
> 
> There's an implementation change, of course, but that's Serhiy's problem
> to deal with and I'm sure that he has considered that. There should be
> no semantic change. When you access obj.__doc__, then and only then are
> the compiled docstrings for that module read from the disk.
> 
> I don't know the current implementation of .pyc files, but I like
> Antoine's suggestion of laying it out in four separate areas (plus
> header), each one marshalled:
> 
>      code
>      docstrings
>      annotations
>      line numbers
> 
> Aside from code, which is mandatory, the three other sections could be
> None to represent "not available", as is the case when you pass -00 to
> the interpreter, or they could be some other sentinel that means "load
> lazily from the appropriate file", or they could be the marshalled data
> directly in place to support byte-code only libraries.
> 
> As for the in-memory data structures of objects themselves, I imagine
> something like the __doc__ and __annotation__ slots pointing to a table
> of strings, which is not initialised until you attempt to read from the
> table. Or something -- don't pay too much attention to my wild guesses.

A __doc__ sentinel could even say something like "bytes 350--420 in the 
original .py file, as UTF-8".

> 
> The bottom line is, is there some reason *aside from performance* to
> avoid this? Because if the performance is worse, I'm sure Serhiy will be
> the first to dump this idea.
> 
> 


More information about the Python-ideas mailing list