[Python-ideas] Move optional data out of pyc files

Wed Apr 11 21:59:26 EDT 2018

On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote:

[...]
> >> Consider a very common use-case: an OS-provided
> >> Python interpreter whose files are all owned by 'root'. Those will be
> >> distributed with .pyc files for performance, but you don't want to
> >> deprive the users of help() and anything else that needs docstrings
> >> etc. So... are the docstrings lazily loaded or eagerly loaded?
> >
> > What relevance is that they're owned by root?
> 
> You have to predict in advance what you'll want to have in your pyc
> files. Can't create them on the fly.

How is that different from the situation right now?

> > What semantic change do you expect?
> >
> > There's an implementation change, of course, but that's Serhiy's problem
> > to deal with and I'm sure that he has considered that. There should be
> > no semantic change. When you access obj.__doc__, then and only then are
> > the compiled docstrings for that module read from the disk.
> 
> In other words, attempting to access obj.__doc__ can actually go and
> open a file. Does it need to check if the file exists as part of the
> import, or does it go back to sys.path? 

That's implementation, so I don't know, but I imagine that the module 
object will have a link pointing directly to the expected file on disk. 
No need to search the path, you just go directly to the expected file. 
Apart from handling the case when it doesn't exist, in which case the 
docstring or annotations get set to None, it should be relatively 
straight-forward.

That link could be an explicit pathname:

    /path/to/__pycache__/foo.cpython-33-doc.pyc

or it could be implicitly built when required from the "master" .pyc 
file's path, since the differences are likely to be deterministic.

> If the former, you're right
> back with the eager loading problem of needing to do 2-4 times as many
> stat calls;

Except that's not eager loading. When you open the file on demand, it 
might never be opened at all. If it is opened, it is likely to be a long 
time after interpreter startup.

> > As for the in-memory data structures of objects themselves, I imagine
> > something like the __doc__ and __annotation__ slots pointing to a table
> > of strings, which is not initialised until you attempt to read from the
> > table. Or something -- don't pay too much attention to my wild guesses.
> >
> > The bottom line is, is there some reason *aside from performance* to
> > avoid this? Because if the performance is worse, I'm sure Serhiy will be
> > the first to dump this idea.
> 
> Obviously it could be turned into just a performance question, but in
> that case everything has to be preloaded

You don't need to preload things to get a performance benefit. 
Preloading things that you don't need immediately and may never need at 
all, like docstrings, annotations and line numbers, is inefficient.

I fear that you have completely failed to understand the (potential) 
performance benefit here.

The point, or at least *a* point, of the exercise is to speed up 
interpreter startup by deferring some of the work until it is needed. 
When you defer work, the pluses are that it reduces startup time, and 
sometimes you can avoid doing it at all; the minus is that if you do end 
up needing to do it, you have to do a little bit extra.

So let's look at a few common scenarios:

1. You run a script. Let's say that the script ends up loading, directly 
or indirectly, 200 modules, none of which need docstrings or annotations 
during runtime, and the script runs to completion without needing to 
display a traceback. You save loading 200 sets of docstrings, 
annotations and line numbers ("metadata" for brevity) so overall the 
interpreter starts up quicker and the script runs faster.

2. You run the same script, but this time it raises an exception and 
displays a traceback. So now you have to load, let's say, 20 sets of 
line numbers, which is a bit slower, but that doesn't happen until the 
exception is raised and the traceback printed, which is already a slow 
and exceptional case so who cares if it takes an extra few milliseconds? 
It is still an overall win because of the 180 sets of metadata you 
didn't need to load.

3. You have a long-running server application which runs for days or 
weeks between restarts. Let's say it loads 1000 modules, so you get 
significant savings during start up (let's say, hypothetically shaving 
off 2 seconds from a 30 second start up time), but over the course of 
the week it ends up eventually loading all 1000 sets of metadata. Since 
that is deferred until needed, it doesn't happen all at once, but spread 
out a little bit at a time.

Overall, you end up doing four times as many file system operations, but 
since they're amortized over the entire week, not startup, it is still a 
win.

(And remember that this extra cost only applies the first time a 
module's metadata is needed. It isn't a cost you keep paying over and 
over again.)

We're (hopefully!) not going to care too much if the first few times the 
server needs to log a traceback, it hits the file system a few extra 
times. Logging tracebacks are already expensive, but they're also 
exceptional and so making them a bit more expensive is nevertheless 
likely to be an overall win if it makes startup faster.

The cost/benefit accounting here is: we care far more about saving 2 
seconds out of the 30 second startup (6% saving) than we care about 
spending an extra 8 seconds spread over a week (0.001% cost).

4. You're running the interactive interpreter. You probably aren't even 
going to notice the fact that it starts up a millisecond faster, or even 
10 ms, but on the other hand you aren't going to notice either if the 
first time you call help(obj) it makes an extra four file system 
accesses and takes an extra few milliseconds. Likewise for tracebacks, 
you're not going to notice or care if it takes 350ms instead of 300ms to 
print a traceback. (Or however long it actually takes -- my care factor 
is too low to even try to measure it.)

These are, in my opinion, typical scenarios. If you're in an atypical 
scenario, say all your modules are loaded over a network running over a 
piece of string stuck between two tin cans *wink*, then you probably 
will feel a lot more pain, but honestly that's not our problem. We're 
not obliged to optimize Python for running on broken networks.

And besides, since we have to support byte-code only modules, and we 
want them to be a single .pyc file not four, people with atypical 
scenarios or make different cost/benefit tradeoffs can always opt-in to 
the single .pyc mode.

[...]
> As a simple example, upgrading your Python installation while you have
> a Python script running can give you this effect already.

Right -- so we're not adding any failure modes that don't already exist.

It is *already* a bad idea to upgrade your Python installation, or even 
modify modules, while Python is running, since the source code may get 
out of sync with the cached line numbers and the tracebacks will become 
inaccurate. This is especially a problem when running in the interactive 
interpreter while editing the file you are running.

> if docstrings are looked up
> separately - and especially if lnotab is too - you could happily
> import and use something (say, in a web server), then run updates, and
> then an exception requires you to look up a line number. Oops, a few
> lines got inserted into that file, and now all the line numbers are
> straight-up wrong. That's a definite behavioural change.

Indeed, but that's no different from what happens now when the same line 
number might point to a different line of source code.

> Maybe it's
> one that's considered acceptable, but it definitely is a change.

I don't think it is a change, and I think it is acceptable. I think the 
solution is, don't upgrade your modules while you're still running them!

-- 
Steve