[Python-ideas] Move optional data out of pyc files
Steven D'Aprano
steve at pearwood.info
Wed Apr 11 21:59:26 EDT 2018
On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote:
[...]
> >> Consider a very common use-case: an OS-provided
> >> Python interpreter whose files are all owned by 'root'. Those will be
> >> distributed with .pyc files for performance, but you don't want to
> >> deprive the users of help() and anything else that needs docstrings
> >> etc. So... are the docstrings lazily loaded or eagerly loaded?
> >
> > What relevance is that they're owned by root?
>
> You have to predict in advance what you'll want to have in your pyc
> files. Can't create them on the fly.
How is that different from the situation right now?
> > What semantic change do you expect?
> >
> > There's an implementation change, of course, but that's Serhiy's problem
> > to deal with and I'm sure that he has considered that. There should be
> > no semantic change. When you access obj.__doc__, then and only then are
> > the compiled docstrings for that module read from the disk.
>
> In other words, attempting to access obj.__doc__ can actually go and
> open a file. Does it need to check if the file exists as part of the
> import, or does it go back to sys.path?
That's implementation, so I don't know, but I imagine that the module
object will have a link pointing directly to the expected file on disk.
No need to search the path, you just go directly to the expected file.
Apart from handling the case when it doesn't exist, in which case the
docstring or annotations get set to None, it should be relatively
straight-forward.
That link could be an explicit pathname:
/path/to/__pycache__/foo.cpython-33-doc.pyc
or it could be implicitly built when required from the "master" .pyc
file's path, since the differences are likely to be deterministic.
> If the former, you're right
> back with the eager loading problem of needing to do 2-4 times as many
> stat calls;
Except that's not eager loading. When you open the file on demand, it
might never be opened at all. If it is opened, it is likely to be a long
time after interpreter startup.
> > As for the in-memory data structures of objects themselves, I imagine
> > something like the __doc__ and __annotation__ slots pointing to a table
> > of strings, which is not initialised until you attempt to read from the
> > table. Or something -- don't pay too much attention to my wild guesses.
> >
> > The bottom line is, is there some reason *aside from performance* to
> > avoid this? Because if the performance is worse, I'm sure Serhiy will be
> > the first to dump this idea.
>
> Obviously it could be turned into just a performance question, but in
> that case everything has to be preloaded
You don't need to preload things to get a performance benefit.
Preloading things that you don't need immediately and may never need at
all, like docstrings, annotations and line numbers, is inefficient.
I fear that you have completely failed to understand the (potential)
performance benefit here.
The point, or at least *a* point, of the exercise is to speed up
interpreter startup by deferring some of the work until it is needed.
When you defer work, the pluses are that it reduces startup time, and
sometimes you can avoid doing it at all; the minus is that if you do end
up needing to do it, you have to do a little bit extra.
So let's look at a few common scenarios:
1. You run a script. Let's say that the script ends up loading, directly
or indirectly, 200 modules, none of which need docstrings or annotations
during runtime, and the script runs to completion without needing to
display a traceback. You save loading 200 sets of docstrings,
annotations and line numbers ("metadata" for brevity) so overall the
interpreter starts up quicker and the script runs faster.
2. You run the same script, but this time it raises an exception and
displays a traceback. So now you have to load, let's say, 20 sets of
line numbers, which is a bit slower, but that doesn't happen until the
exception is raised and the traceback printed, which is already a slow
and exceptional case so who cares if it takes an extra few milliseconds?
It is still an overall win because of the 180 sets of metadata you
didn't need to load.
3. You have a long-running server application which runs for days or
weeks between restarts. Let's say it loads 1000 modules, so you get
significant savings during start up (let's say, hypothetically shaving
off 2 seconds from a 30 second start up time), but over the course of
the week it ends up eventually loading all 1000 sets of metadata. Since
that is deferred until needed, it doesn't happen all at once, but spread
out a little bit at a time.
Overall, you end up doing four times as many file system operations, but
since they're amortized over the entire week, not startup, it is still a
win.
(And remember that this extra cost only applies the first time a
module's metadata is needed. It isn't a cost you keep paying over and
over again.)
We're (hopefully!) not going to care too much if the first few times the
server needs to log a traceback, it hits the file system a few extra
times. Logging tracebacks are already expensive, but they're also
exceptional and so making them a bit more expensive is nevertheless
likely to be an overall win if it makes startup faster.
The cost/benefit accounting here is: we care far more about saving 2
seconds out of the 30 second startup (6% saving) than we care about
spending an extra 8 seconds spread over a week (0.001% cost).
4. You're running the interactive interpreter. You probably aren't even
going to notice the fact that it starts up a millisecond faster, or even
10 ms, but on the other hand you aren't going to notice either if the
first time you call help(obj) it makes an extra four file system
accesses and takes an extra few milliseconds. Likewise for tracebacks,
you're not going to notice or care if it takes 350ms instead of 300ms to
print a traceback. (Or however long it actually takes -- my care factor
is too low to even try to measure it.)
These are, in my opinion, typical scenarios. If you're in an atypical
scenario, say all your modules are loaded over a network running over a
piece of string stuck between two tin cans *wink*, then you probably
will feel a lot more pain, but honestly that's not our problem. We're
not obliged to optimize Python for running on broken networks.
And besides, since we have to support byte-code only modules, and we
want them to be a single .pyc file not four, people with atypical
scenarios or make different cost/benefit tradeoffs can always opt-in to
the single .pyc mode.
[...]
> As a simple example, upgrading your Python installation while you have
> a Python script running can give you this effect already.
Right -- so we're not adding any failure modes that don't already exist.
It is *already* a bad idea to upgrade your Python installation, or even
modify modules, while Python is running, since the source code may get
out of sync with the cached line numbers and the tracebacks will become
inaccurate. This is especially a problem when running in the interactive
interpreter while editing the file you are running.
> if docstrings are looked up
> separately - and especially if lnotab is too - you could happily
> import and use something (say, in a web server), then run updates, and
> then an exception requires you to look up a line number. Oops, a few
> lines got inserted into that file, and now all the line numbers are
> straight-up wrong. That's a definite behavioural change.
Indeed, but that's no different from what happens now when the same line
number might point to a different line of source code.
> Maybe it's
> one that's considered acceptable, but it definitely is a change.
I don't think it is a change, and I think it is acceptable. I think the
solution is, don't upgrade your modules while you're still running them!
--
Steve
More information about the Python-ideas
mailing list