Move optional data out of pyc files

Currently pyc files contain data that is useful mostly for developing and is not needed in most normal cases in stable program. There is even an option that allows to exclude a part of this information from pyc files. It is expected that this saves memory, startup time, and disk space (or the time of loading from network). I propose to move this data from pyc files into separate file or files. pyc files should contain only external references to external files. If the corresponding external file is absent or specific option suppresses them, references are replaced with None or NULL at import time, otherwise they are loaded from external files. 1. Docstrings. They are needed mainly for developing. 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for tracing, and debugging with the debugger. Sources are helpful in such cases too. If the program doesn't contain errors ;-) and is sipped without sources, they could be removed. 3. Annotations. They are used mainly by third party tools that statically analyze sources. They are rarely used at runtime. Docstrings will be read from the corresponding docstring file unless -OO is supplied. This will allow also to localize docstrings. Depending on locale or other settings different docstring file can be used. For suppressing line numbers and annotations new options can be added.

On Tue, 10 Apr 2018 19:14:58 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
Indeed, it may be nice to find a solution to ship them separately.
What is the weight of lnotab arrays? While docstrings can be large, I'm somehow skeptical that removing lnotab arrays would bring a significant improvement. It would be nice to have more data about this.
3. Annotations. They are used mainly by third party tools that statically analyze sources. They are rarely used at runtime.
Even less used than docstrings probably. Regards Antoine.

On Tue, Apr 10, 2018 at 12:51 PM Eric V. Smith <eric@trueblade.com> wrote:
Yep. Everything accessible in any way at runtime is used by something at runtime. It's a public API, we can't just get rid of it. Several libraries rely on docstrings being available (additional case in point beyond the already linked to cli tool: ply <http://www.dabeaz.com/ply/ply.html>) Most of the world never appears to use -O and -OO. If they do, they don't use these libraries or jump through special hoops to prevent pyo compliation of any sources that need them. (unlikely) -gps

On Tue, Apr 10, 2018 at 9:50 PM, Eric V. Smith <eric@trueblade.com> wrote:
Astropy uses annotations at runtime for optional unit checking on arguments that take dimensionful quantities: http://docs.astropy.org/en/stable/api/astropy.units.quantity_input.html#astr...

10.04.18 19:24, Antoine Pitrou пише:
Maybe it is low. I just mentioned three kinds of data in pyc files that can be optional. If move out docstrings and annotations, why not move lnotabs? It would be easy if we already implement the infrastructure for others two.
And since there is a way of providing annotations in human-readable format separately from source codes, it looks naturally to provide a way for compiling them into separate files.

On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
A deployed Python distribution generally has .pyc files for all of the standard library. I don't think people want to lose the ability to call help(), and unless I'm misunderstanding, that requires docstrings. So this will mean twice as many files and twice as many file-open calls to import from the standard library. What will be the impact on startup time? ChrisA

There are libraries out there like this: https://docopt.readthedocs.io/en/0.2.0/ which use docstrings for runtime info. Today we already have -OO which allows us to create docstring-less bytecode files in case we have, after careful consideration, established that it is safe to do so. I think the current way (-OO) to avoid docstring loading is the correct one. It pushes the responsibility on whoever did the packaging to decide if -OO is appropriate. The ability to remove the docstrings after bytecode generation would be kinda nice (similar to Unix "strip" command) but given how fast bytecode compilation is, frankly I don't think it is very important. Stephan 2018-04-10 19:54 GMT+02:00 Zachary Ware <zachary.ware+pydev@gmail.com>:

On Tue, 10 Apr 2018 11:13:01 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
"python -O" and "python -OO" *do* generate different pyc files. If you want to trim docstrings with those options, you need to regenerate pyc files for all your dependencies (including third-party libraries and standard library modules). Serhiy's proposal allows "-O" and "-OO" to work without needing a custom bytecode generation step. Regard Antoine.

On 10/04/2018 18:54, Zachary Ware wrote:
Personally I quite like the idea of having the doc strings, and possibly other optional components, in a zipped section after a marker for the end of the operational code. Possibly the loader could stop reading at that point, (reducing load time and memory impact), and only load and unzip on demand. Zipping the doc strings should have a significant reduction in file sizes but it is worth remembering a couple of things: - Python is already one of the most compact languages for what it can do - I have had experts demanding to know where the rest of the program is hidden and how it is being downloaded when they noticed the size of the installed code verses the functionality provided. - File size <> disk space consumed - on most file systems each file typically occupies 1 + (file_size // allocation_size) clusters of the drive and with increasing disk sizes generally the allocation_size is increasing both of my NTFS drives currently have 4096 byte allocation sizes but I am offered up to 2 MB allocation sizes - splitting a .pyc 10,052 byte .pyc file, (picking a random example from my drive) into a 5,052 and 5,000 byte files will change the disk space occupied from 3*4,096 to 4*4,096 plus the extra directory entry. - Where absolute file size is critical you, (such as embedded systems), can always use the -O & -OO flags. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

On Wed, Apr 11, 2018 at 03:38:08AM +1000, Chris Angelico wrote:
I shouldn't think that the number of files on disk is very important, now that they're hidden away in the __pycache__ directory where they can be ignored by humans. Even venerable old FAT32 has a limit of 65,534 files in a single folder, and 268,435,437 on the entire volume. So unless the std lib expands to 16000+ modules, the number of files in the __pycache__ directory ought to be well below that limit. I think even MicroPython ought to be okay with that. (But it would be nice to find out for sure: does it support file systems with *really* tiny limits?) The entire __pycache__ directory is supposed to be a black box except under unusual circumstances, so it doesn't matter (at least not to me) if we have: __pycache__/spam.cpython-38.pyc alone or: __pycache__/spam.cpython-38.pyc __pycache__/spam.cpython-38-doc.pyc __pycache__/spam.cpython-38-lno.pyc __pycache__/spam.cpython-38-ann.pyc (say). And if the external references are loaded lazily, on need, rather than eagerly, this could save startup time, which I think is the intention. The doc strings would be still available, just not loaded until the first time you try to use them. However, Python supports byte-code only distribution, using .pyc files external to the __pycache__. In that case, it would be annoying and inconvenient to distribute four top-level files, so I think that the use of external references has to be optional, and there has to be a way to either compile to a single .pyc file containing all four parts, or an external tool that can take the existing four files and merge them. -- Steve

On Wed, Apr 11, 2018 at 10:03 AM, Steven D'Aprano <steve@pearwood.info> wrote:
File system limits aren't usually an issue; as you say, even FAT32 can store a metric ton of files in a single directory. I'm more interested in how long it takes to open a file, and whether doubling that time will have a measurable impact on Python startup time. Part of that cost can be reduced by using openat(), on platforms that support it, but even with a directory handle, there's still a definite non-zero cost to opening and reading an additional file. ChrisA

On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:
Yes, it will double the number of files. Actually quadruple it, if the annotations and line numbers are in separate files too. But if most of those extra files never need to be opened, then there's no cost to them. And whatever extra cost there is, is amortized over the lifetime of the interpreter. The expectation here is that this could lead to reducing startup time, since the files which are read are smaller and less data needs to be read and traverse the network up front, but can be defered until they're actually needed. Serhiy is experienced enough that I think we should assume he's not going to push this optimization into production unless it actually does reduce startup time. He has proven himself enough that we should assume competence rather than incompetence :-) Here is the proposal as I understand it: - by default, change .pyc files to store annotations, docstrings and line numbers as references to external files which will be lazily loaded on-need; - single-file .pyc files must still be supported, but this won't be the default and could rely on an external "merge" tool; - objects that rely on docstrings or annotations, such as dataclass, may experience a (hopefully very small) increase of import time, since they may not be able to defer loading the extra files; - but in general, most modules should (we expect) see an decrease in the load time; - which will (we hope) reduce startup time; - libraries which make eager use of docstrings and annotations might even ship with the single-file .pyc instead (the library installer can look after that aspect), and so avoid any extra cost. Naturally pushing this into production will require benchmarks that prove this actually does improve startup time. I believe that Serhiy's reason for asking is to determine whether it is worth his while to experiment on this. There's no point in implementing these changes and benchmarking them, if there's no chance of it being accepted. So on the assumptions that: - benchmarking does demonstrate a non-trivial speedup of interpreter startup; - single-file .pyc files are still supported, for the use of byte-code only libraries; - and modules which are particularly badly impacted by this change are able to opt-out and use a single .pyc file; I see no reason not to support this idea if Serhiy (or someone else) is willing to put in the work. -- Steve

On Wed, Apr 11, 2018 at 1:02 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, if they are actually not needed. My question was about whether that is truly valid. Consider a very common use-case: an OS-provided Python interpreter whose files are all owned by 'root'. Those will be distributed with .pyc files for performance, but you don't want to deprive the users of help() and anything else that needs docstrings etc. So... are the docstrings lazily loaded or eagerly loaded? If eagerly, you've doubled the number of file-open calls to initialize the interpreter. (Or quadrupled, if you need annotations and line numbers and they're all separate.) If lazily, things are a lot more complicated than the original description suggested, and there'd need to be some semantic changes here.
Oh, I'm definitely assuming that he knows what he's doing :-) Doesn't mean I can't ask the question though. ChrisA

On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote: [...]
We're never really going to know the affect on performance without implementing and benchmarking the code. It might turn out that, to our surprise, three quarters of the std lib relies on loading docstrings during startup. But I doubt it.
What relevance is that they're owned by root?
If eagerly, you've doubled the number of file-open calls to initialize the interpreter.
I do not understand why you think this is even an option. Has Serhiy said something that I missed that makes this seem to be on the table? That's not a rhetorical question -- I may have missed something. But I'm sure he understands that doubling or quadrupling the number of file operations during startup is not an optimization.
What semantic change do you expect? There's an implementation change, of course, but that's Serhiy's problem to deal with and I'm sure that he has considered that. There should be no semantic change. When you access obj.__doc__, then and only then are the compiled docstrings for that module read from the disk. I don't know the current implementation of .pyc files, but I like Antoine's suggestion of laying it out in four separate areas (plus header), each one marshalled: code docstrings annotations line numbers Aside from code, which is mandatory, the three other sections could be None to represent "not available", as is the case when you pass -00 to the interpreter, or they could be some other sentinel that means "load lazily from the appropriate file", or they could be the marshalled data directly in place to support byte-code only libraries. As for the in-memory data structures of objects themselves, I imagine something like the __doc__ and __annotation__ slots pointing to a table of strings, which is not initialised until you attempt to read from the table. Or something -- don't pay too much attention to my wild guesses. The bottom line is, is there some reason *aside from performance* to avoid this? Because if the performance is worse, I'm sure Serhiy will be the first to dump this idea. -- Steve

On Wed, Apr 11, 2018 at 4:06 PM, Steven D'Aprano <steve@pearwood.info> wrote:
You have to predict in advance what you'll want to have in your pyc files. Can't create them on the fly.
In other words, attempting to access obj.__doc__ can actually go and open a file. Does it need to check if the file exists as part of the import, or does it go back to sys.path? If the former, you're right back with the eager loading problem of needing to do 2-4 times as many stat calls; if the latter, it's semantically different in that a change to sys.path can influence something that normally is preloaded.
Obviously it could be turned into just a performance question, but in that case everything has to be preloaded, and I doubt there's going to be any advantage. To be absolutely certain of retaining the existing semantics, there'd need to be some sort of anchoring to ensure that *this* .pyc file goes with *that* .pyc_docstrings file. Looking them up anew will mean that there's every possibility that you get the wrong file back. As a simple example, upgrading your Python installation while you have a Python script running can give you this effect already. Just import a few modules, then change everything on disk. If you now import a module that was already imported, you get it from cache (and the unmodified version); import something that wasn't imported already, and it goes to the disk. At the granularity of modules, this is seldom a problem (I can imagine some package modules getting confused by this, but otherwise not usually), but if docstrings are looked up separately - and especially if lnotab is too - you could happily import and use something (say, in a web server), then run updates, and then an exception requires you to look up a line number. Oops, a few lines got inserted into that file, and now all the line numbers are straight-up wrong. That's a definite behavioural change. Maybe it's one that's considered acceptable, but it definitely is a change. And if mutations to sys.path can do this, it's definitely a semantic change in Python. ChrisA

On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote: [...]
How is that different from the situation right now?
That's implementation, so I don't know, but I imagine that the module object will have a link pointing directly to the expected file on disk. No need to search the path, you just go directly to the expected file. Apart from handling the case when it doesn't exist, in which case the docstring or annotations get set to None, it should be relatively straight-forward. That link could be an explicit pathname: /path/to/__pycache__/foo.cpython-33-doc.pyc or it could be implicitly built when required from the "master" .pyc file's path, since the differences are likely to be deterministic.
Except that's not eager loading. When you open the file on demand, it might never be opened at all. If it is opened, it is likely to be a long time after interpreter startup.
You don't need to preload things to get a performance benefit. Preloading things that you don't need immediately and may never need at all, like docstrings, annotations and line numbers, is inefficient. I fear that you have completely failed to understand the (potential) performance benefit here. The point, or at least *a* point, of the exercise is to speed up interpreter startup by deferring some of the work until it is needed. When you defer work, the pluses are that it reduces startup time, and sometimes you can avoid doing it at all; the minus is that if you do end up needing to do it, you have to do a little bit extra. So let's look at a few common scenarios: 1. You run a script. Let's say that the script ends up loading, directly or indirectly, 200 modules, none of which need docstrings or annotations during runtime, and the script runs to completion without needing to display a traceback. You save loading 200 sets of docstrings, annotations and line numbers ("metadata" for brevity) so overall the interpreter starts up quicker and the script runs faster. 2. You run the same script, but this time it raises an exception and displays a traceback. So now you have to load, let's say, 20 sets of line numbers, which is a bit slower, but that doesn't happen until the exception is raised and the traceback printed, which is already a slow and exceptional case so who cares if it takes an extra few milliseconds? It is still an overall win because of the 180 sets of metadata you didn't need to load. 3. You have a long-running server application which runs for days or weeks between restarts. Let's say it loads 1000 modules, so you get significant savings during start up (let's say, hypothetically shaving off 2 seconds from a 30 second start up time), but over the course of the week it ends up eventually loading all 1000 sets of metadata. Since that is deferred until needed, it doesn't happen all at once, but spread out a little bit at a time. Overall, you end up doing four times as many file system operations, but since they're amortized over the entire week, not startup, it is still a win. (And remember that this extra cost only applies the first time a module's metadata is needed. It isn't a cost you keep paying over and over again.) We're (hopefully!) not going to care too much if the first few times the server needs to log a traceback, it hits the file system a few extra times. Logging tracebacks are already expensive, but they're also exceptional and so making them a bit more expensive is nevertheless likely to be an overall win if it makes startup faster. The cost/benefit accounting here is: we care far more about saving 2 seconds out of the 30 second startup (6% saving) than we care about spending an extra 8 seconds spread over a week (0.001% cost). 4. You're running the interactive interpreter. You probably aren't even going to notice the fact that it starts up a millisecond faster, or even 10 ms, but on the other hand you aren't going to notice either if the first time you call help(obj) it makes an extra four file system accesses and takes an extra few milliseconds. Likewise for tracebacks, you're not going to notice or care if it takes 350ms instead of 300ms to print a traceback. (Or however long it actually takes -- my care factor is too low to even try to measure it.) These are, in my opinion, typical scenarios. If you're in an atypical scenario, say all your modules are loaded over a network running over a piece of string stuck between two tin cans *wink*, then you probably will feel a lot more pain, but honestly that's not our problem. We're not obliged to optimize Python for running on broken networks. And besides, since we have to support byte-code only modules, and we want them to be a single .pyc file not four, people with atypical scenarios or make different cost/benefit tradeoffs can always opt-in to the single .pyc mode. [...]
As a simple example, upgrading your Python installation while you have a Python script running can give you this effect already.
Right -- so we're not adding any failure modes that don't already exist. It is *already* a bad idea to upgrade your Python installation, or even modify modules, while Python is running, since the source code may get out of sync with the cached line numbers and the tracebacks will become inaccurate. This is especially a problem when running in the interactive interpreter while editing the file you are running.
Indeed, but that's no different from what happens now when the same line number might point to a different line of source code.
Maybe it's one that's considered acceptable, but it definitely is a change.
I don't think it is a change, and I think it is acceptable. I think the solution is, don't upgrade your modules while you're still running them! -- Steve

On Thu, Apr 12, 2018 at 11:59 AM, Steven D'Aprano <steve@pearwood.info> wrote:
If the files aren't owned by root (more specifically, if they're owned by you, and you can write to the pycache directory), you can do everything at runtime. Otherwise, you have to do everything at installation time.
Referencing a path name requires that each directory in it be opened. Checking to see if the file exists requires, at absolute best, one more stat call, and that's assuming you have an open handle to the directory.
I have no idea what you mean here. Eager loading != opening the file on demand. Eager statting != opening on demand. If you're not going to hold open handles to heaps of directories, you have to reference everything by path name.
Right, and if you DON'T preload everything, you have a potential semantic difference. Which is exactly what you were asking me, and I was answering.
Does this loading happen when the exception is constructed or when it's printed? How much can you do with an exception without triggering the loading of metadata? Is it now possible for the mere formatting of a traceback to fail because of disk/network errors?
People DO run Python over networks, though, and people DO upgrade their Python installations.
Do you terminate every single Python process on your system before you upgrade Python? Let's say you're running a server on Red Hat Enterprise Linux or Debian Stable, and you go to apply all the latest security updates. Is that best done by shutting down every single application, THEN applying all updates, and only when that's all done, starting everything up? Or do you update everything on the disk, then pick one process at a time and signal it to restart? I don't know for sure about RHEL, but I do know that Debian's package management system involves a lot of Python. So it'd be a bit tricky to build your updater such that no Python is running during updates - you'd have to deploy a brand-new Python tree somewhere to use for installation, or something. And if you have any tiny little wrapper scripts written in Python, they could easily still be running across an update, even if the rest of the app is written in C. So, no. You should NOT have to take a blanket rule of "don't update while it's running". Instead, what you have is: "Binaries can safely be unlinked, and Python modules only get loaded when you import them".
Yes, this is true; but at least the mapping from byte code to line number is trustworthy. Worst case, you look at the traceback, and then interpret it based on an older copy of the .py file. If lnotab is loaded lazily, you don't even have that. Something's going to have to try to figure out what the mapping is.
If you need a solution to it, then it IS a change. Doesn't mean it can't be done, but it definitely is a change. (Look at the PEP 572 changes to list comprehensions at class scope. Nobody's denying that the semantics are changing; but normal usage won't ever witness the changes.) I don't think this is purely a performance question. ChrisA

On 04/11/18 06:21, Chris Angelico wrote:
Currently in Fedora, we ship *both* optimized and non-optimized pycs to make sure both -O and non--O will work nicely without root privilieges. So splitting the docstrings into a separate file would be, for us, a benefit in terms of file size.

On 4/11/2018 4:26 AM, Petr Viktorin wrote:
Currently, the Windows installer has an option to pre-compile stdlib modules. (At least it does if one does an all-users installation.) If one selects this, it creates normal, -O, and -OO versions of each. Since, like most people, I never run with -O or -OO, replacing this redundancy with 1 segmented file or 2 non-redundant files might be a win for most people. -- Terry Jan Reedy

On Tue, Apr 10, 2018 at 5:03 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Our product uses the doc strings for auto-generated help, so we need to keep those. We also allow users to write plugins and scripts, so getting valid feedback in tracebacks is essential for our support people, so we'll keep the lno files, too. Annotations can probably go. Looking at one of our little pyc files, I see: -rwx------+ 1 efahlgren admins 9252 Apr 10 17:25 ./lm/lib/config.pyc* Since disk blocks are typically 4096 bytes, that's really a 12k file. Let's say it's 8k of byte code, 1k of doc, a bit of lno. So the proposed layout would give: config.pyc -> 8k config-doc.pyc -> 4k config-lno.pyc -> 4k So now I've increased disk usage by 25% (yeah yeah, I know, I picked that small file on purpose to illustrate the point, but it's not unusual). These files are often opened over a network, at least for user plugins. This can take a really, really long time on some of our poorly connected machines, like 1-2 seconds per file (no kidding, it's horrible). Now instead of opening just one file in 1-2 seconds, we have increased the time by 300%, just to do the stat+open, probably another stat to make sure there's no "ann" file laying about. Ouch. -1 from me.

2018-04-11 2:03 GMT+02:00 Steven D'Aprano <steve@pearwood.info>: [snip]
[snip] Hi all, Just for information for everyone: (I was a VMS system manager more than a decade ago, and I know that Win NT (at least the core) is developed by a former VMS engineer. NTFS is created on the bases of Files-11 (Files-11B) file system. And in both file systems the directory is a tree (in Files-11 it is a B-tree, maybe in NTFS it is different tree, but tree). Holding the files ordered alphabetically. And if there are "too much" files then accessing files will be slower. (check for example the windows\system32 folder). Of course it is not matter if there are some hundred or 1-2 thousand files. But the too much matters. I did a little measurement (intentionally not used functions not to make the result wrong): import os import time try: os.mkdir('tmp_thousands_of_files') except: pass name1 = 10001 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name1) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'one file time {file_time} \n {start} \n {stop}') for i in range(10002, 20000): file_name = 'tmp_thousands_of_files/' + str(i) f = open(file_name, 'w') f.write('aaa') f.close() name2 = 10000 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name2) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'after 10k, name before {file_time} \n {start} \n {stop}') name3 = 20010 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name3) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'after 10k, name after {file_time} \n {start} \n {stop}') """ result c:\>python several_files_in_one_folder.py one file time 0.0 1523476699.5144918 1523476699.5144918 after 10k, name before 0.015625953674316406 1523476714.622918 1523476714.6385438 after 10k, name after 0.0 1523476714.6385438 1523476714.6385438 """ used: Python 3.6.1, windows 8.1, SSD drive As you can see, when there an insertion into the beginning of the tree it is much slower then adding to the end. (yes, I know the list insertion is slow as well, but I saw VMS directory with 50k files, and the dir command gave 5-10 files then waited some seconds before the next 5-10 files ... ;-) ) BR, George

10.04.18 20:38, Chris Angelico пише:
Yes, this will mean more syscalls when import with docstrings. But the startup time doesn't matter for interactive shell in which you call help(). It was expected that programs which need to gain the benefit from separating optional components will run without loading them (like with option -OO). The overhead can be reduced by packing multiple files in a single archive. Finally, loading docstrings and other optional components can be made lazy. This was not in my original idea, and this will significantly complicate the implementation, but in principle it is possible. This will require larger changes in the marshal format and bytecode. This can open a door for further enhancements: loading the code and building classes and other complex data (especially heavy namedtuples, enums and dataclasses) on demand. Often you need to use just a single attribute or function from a large module. But this is different change, out of scope of this topic.

I'm +1 on this idea. * New pyc format has code section (same to current) and text section. text section stores UTF-8 strings and not loaded at import time. * Function annotation (only when PEP 563 is used) and docstring are stored as integer, point to offset in the text section. * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are integer, text is loaded from the text section lazily. PEP 563 will reduce some startup time, but __annotation__ is still dict. Memory overhead is negligible. In [1]: def foo(a: int, b: int) -> int: ...: return a + b ...: ...: In [2]: import sys In [3]: sys.getsizeof(foo) Out[3]: 136 In [4]: sys.getsizeof(foo.__annotations__) Out[4]: 240 When PEP 563 is used, there are no side effect while building the annotation. So the annotation can be serialized in text, like {"a":"int","b":"int","return":"int"}. This change will require new pyc format, and descriptor for PyFunction.__doc__, PyFunction.__annotation__ and type.__doc__. Regards, -- INADA Naoki <songofacandy@gmail.com>

One implementation difficulty specifically related to annotations, is that they are quite hard to find/extract from the code objects. Both docstrings and lnotab are within specific fields of the code object for their function/class/module; annotations are spread as individual constants (assuming PEP 563), which are loaded in bytecode through separate LOAD_CONST statements before creating the function object, and that can happen in the middle of bytecode for the higher level object (the module or class containing a function definition). So the change for achieving that will be more significant than just "add a couple of descriptors to function objects and change the module marshalling code". Probably making annotations fit a single structure that can live in co_consts could make this change easier, and also make startup of annotated modules faster (because you just load a single constant instead of one per argument), this might be a valuable change by itself. On 12 April 2018 at 11:48, INADA Naoki <songofacandy@gmail.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

I've been playing a bit with this trying to collect some data and measure how useful this would be. You can take a look at the script I'm using at: https://github.com/dmoisset/pycstats What I'm measuring is: 1. Number of objects in the pyc, and how many of those are: * docstrings (I'm using a heuristic here which I'm not 100% sure it is correct) * lnotabs * Duplicate objects; these have not been discussed in this thread before but are another source of optimization I noticed while writing this. Essentially I'm refering to immutable constants that are instanced more than once and could be shared. You can also measure the effect of this optimization across modules and within a single module[1] 2. Bytes used in memory by the categories above (sum of sys.getsizeof() for each category). I'm not measuring anything related to annotations because, as I mentioned before, they are generated piecemeal by executable bytecode so they are hard to separate Running this on my python 3.6 pyc cache I get: $ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py 8645 docstrings, 1705441B 19060 lineno tables, 941702B 59382/202898 duplicate objects for 3101287/18582807 memory size So this means around ~10% of the memory used after loading is used for docstrings, ~5% for lnotabs, and ~15% for objects that could be shared. The sharing assumes we can share betwwen modules, but even doing it within modules, you can get to ~7%. In short, this could mean a 25%-35% reduction in memory use for code objects if the stdlib is a good benchmark. Best, D. [1] Regarding duplicates, I've found some unexpected things within loaded code objects, for example instances of the small integer "1" with different id() than the singleton that cpython normally uses for "1", although most duplicates are some small strings, tuples with argument names, or . Something that could be interesting to write is a "pyc optimizer" that removes duplicates, this should be a gain at a minimal preprocessing cost. On 12 April 2018 at 15:16, Daniel Moisset <dmoisset@machinalis.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

I think moving data out of pyc files is going in a wrong direction: more stat calls means slower import and slower startup time. Trying to make pycs smaller also isn't really worth it (they compress quite well). Saving memory could be done by disabling reading objects lazily from the file - without removing anything from the pyc file. Whether the few 100kB RAM this saves is worth the effort depends on the application space. This leaves the proposal to restructure pyc files into a sectioned file and possibly indexed file to make access to (lazily) loaded parts faster. More structure would add ways to more easily update the content going forward (similar to how PE executable files are structured) and allow us to get rid of extra pyc file variants (e.g. for special optimized versions). So that's an interesting approach :-) BTW: In all this, please remember that quite a few applications do use doc strings as part of the code, not only for documentation. Most prominent are probably parsers which keep the parsing definitions in doc strings. On 12.04.2018 20:32, Daniel Moisset wrote:
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Apr 12 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Fri, 13 Apr 2018 at 03:47, M.-A. Lemburg <mal@egenix.com> wrote:
+1. With this in place -O and -OO cmdline options would become even less useful (which is good). -- Giampaolo - http://grodola.blogspot.com

On 2018-04-12, M.-A. Lemburg wrote:
I would like to see a format can hold one or more modules in a single file. Something like the zip format but optimized for fast interpreter startup time. It should support lazy loading of module parts (e.g. maybe my lazy bytecode execution idea[1]). Obviously a lot of details to work out. The design should also take into account the widespread use of virtual environments. So, it should be easy and space efficient to build virtual environments using this format (e.g. maybe allow overlays so that stdlib package is not copied into virtual environment, virtual packages would be overlaid on stdlib file). Also, should be easy to bundle all modules into a "uber" package and append it to the Python executable. CPython should provide out-of-box support for single-file executables. 1. https://github.com/python/cpython/pull/6194

On Sat, 14 Apr 2018 at 17:01 Neil Schemenauer <nas-python-ideas@arctrix.com> wrote:
Eric Snow, Barry Warsaw, and I chatted about a custom file format for holding Python source (and data files). My notes on the chat can be found at https://notebooks.azure.com/Brett/libraries/design-ideas/html/Python%20sourc... . (And since we aren't trying to rewrite bytecode we figured it wouldn't break your proposal, Neil ;) . -Brett

I'm not sure I understand the benefit of this, perhaps you can clarify. What I see is two scenarios Scenario A) External files are present In this case, the data is loaded from the pyc and then from external file, so there are no savings in memory, startup time, disk space, or network load time, it's just the same disk information and runtime structure with a different layout Scenario B) External files are not present In this case, you get runtime improvements exactly identical to not having the data in the pyc which is roughly what you get with -OO. The only new capability I see this adds is the localization benefit, is that what this proposal is about? On 10 April 2018 at 17:14, Serhiy Storchaka <storchaka@gmail.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

On Tue, 10 Apr 2018 19:14:58 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
An alternate proposal would be to have separate sections in a single marshal file. The main section (containing the loadable module) would have references to the other sections. This way it's easy for the loader to say "all references to the docstring section and/or to the annotation section are replaced with None", depending on how Python is started. It would also be possible to do it on disk with a strip-like utility. I'm not volunteering to do all this, so just my 2 cents ;-) Regards Antoine.

On 11 April 2018 at 02:14, Serhiy Storchaka <storchaka@gmail.com> wrote:
While I don't think the default inline pyc format should change, in my ideal world I'd like to see the optimized format change to a side-loading model where these things are still emitted, but they're placed in a separate metadata file that isn't loaded by default. The metadata file would then be lazily loaded at runtime, such that `-O` gave you the memory benefits of `-OO`, but docstrings/annotations/source line references/etc could still be loaded on demand if something actually needed them. This approach would also mitigate the valid points Chris Angelico raises around hot reloading support - we could just declare that it requires even more care than usual to use hot reloading in combination with `-O`. Bonus points if the sideloaded metadata file could be designed in such a way that an extension module compiler like Cython or an alternate pyc compiler frontend like Hylang could use it to provide relevant references back to the original source code (JavaScript's source maps may provide inspiration on that front). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, 10 Apr 2018 19:14:58 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
Indeed, it may be nice to find a solution to ship them separately.
What is the weight of lnotab arrays? While docstrings can be large, I'm somehow skeptical that removing lnotab arrays would bring a significant improvement. It would be nice to have more data about this.
3. Annotations. They are used mainly by third party tools that statically analyze sources. They are rarely used at runtime.
Even less used than docstrings probably. Regards Antoine.

On Tue, Apr 10, 2018 at 12:51 PM Eric V. Smith <eric@trueblade.com> wrote:
Yep. Everything accessible in any way at runtime is used by something at runtime. It's a public API, we can't just get rid of it. Several libraries rely on docstrings being available (additional case in point beyond the already linked to cli tool: ply <http://www.dabeaz.com/ply/ply.html>) Most of the world never appears to use -O and -OO. If they do, they don't use these libraries or jump through special hoops to prevent pyo compliation of any sources that need them. (unlikely) -gps

On Tue, Apr 10, 2018 at 9:50 PM, Eric V. Smith <eric@trueblade.com> wrote:
Astropy uses annotations at runtime for optional unit checking on arguments that take dimensionful quantities: http://docs.astropy.org/en/stable/api/astropy.units.quantity_input.html#astr...

10.04.18 19:24, Antoine Pitrou пише:
Maybe it is low. I just mentioned three kinds of data in pyc files that can be optional. If move out docstrings and annotations, why not move lnotabs? It would be easy if we already implement the infrastructure for others two.
And since there is a way of providing annotations in human-readable format separately from source codes, it looks naturally to provide a way for compiling them into separate files.

On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
A deployed Python distribution generally has .pyc files for all of the standard library. I don't think people want to lose the ability to call help(), and unless I'm misunderstanding, that requires docstrings. So this will mean twice as many files and twice as many file-open calls to import from the standard library. What will be the impact on startup time? ChrisA

On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico <rosuav@gmail.com> wrote:
What about instead of separate files turning the single file into a pseudo-zip file containing all of the proposed files, and provide a simple tool for removing whatever parts you don't want? -- Zach

There are libraries out there like this: https://docopt.readthedocs.io/en/0.2.0/ which use docstrings for runtime info. Today we already have -OO which allows us to create docstring-less bytecode files in case we have, after careful consideration, established that it is safe to do so. I think the current way (-OO) to avoid docstring loading is the correct one. It pushes the responsibility on whoever did the packaging to decide if -OO is appropriate. The ability to remove the docstrings after bytecode generation would be kinda nice (similar to Unix "strip" command) but given how fast bytecode compilation is, frankly I don't think it is very important. Stephan 2018-04-10 19:54 GMT+02:00 Zachary Ware <zachary.ware+pydev@gmail.com>:

On Tue, 10 Apr 2018 11:13:01 -0700 Ethan Furman <ethan@stoneleaf.us> wrote:
"python -O" and "python -OO" *do* generate different pyc files. If you want to trim docstrings with those options, you need to regenerate pyc files for all your dependencies (including third-party libraries and standard library modules). Serhiy's proposal allows "-O" and "-OO" to work without needing a custom bytecode generation step. Regard Antoine.

On 10/04/2018 18:54, Zachary Ware wrote:
Personally I quite like the idea of having the doc strings, and possibly other optional components, in a zipped section after a marker for the end of the operational code. Possibly the loader could stop reading at that point, (reducing load time and memory impact), and only load and unzip on demand. Zipping the doc strings should have a significant reduction in file sizes but it is worth remembering a couple of things: - Python is already one of the most compact languages for what it can do - I have had experts demanding to know where the rest of the program is hidden and how it is being downloaded when they noticed the size of the installed code verses the functionality provided. - File size <> disk space consumed - on most file systems each file typically occupies 1 + (file_size // allocation_size) clusters of the drive and with increasing disk sizes generally the allocation_size is increasing both of my NTFS drives currently have 4096 byte allocation sizes but I am offered up to 2 MB allocation sizes - splitting a .pyc 10,052 byte .pyc file, (picking a random example from my drive) into a 5,052 and 5,000 byte files will change the disk space occupied from 3*4,096 to 4*4,096 plus the extra directory entry. - Where absolute file size is critical you, (such as embedded systems), can always use the -O & -OO flags. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG. http://www.avg.com

On Wed, Apr 11, 2018 at 03:38:08AM +1000, Chris Angelico wrote:
I shouldn't think that the number of files on disk is very important, now that they're hidden away in the __pycache__ directory where they can be ignored by humans. Even venerable old FAT32 has a limit of 65,534 files in a single folder, and 268,435,437 on the entire volume. So unless the std lib expands to 16000+ modules, the number of files in the __pycache__ directory ought to be well below that limit. I think even MicroPython ought to be okay with that. (But it would be nice to find out for sure: does it support file systems with *really* tiny limits?) The entire __pycache__ directory is supposed to be a black box except under unusual circumstances, so it doesn't matter (at least not to me) if we have: __pycache__/spam.cpython-38.pyc alone or: __pycache__/spam.cpython-38.pyc __pycache__/spam.cpython-38-doc.pyc __pycache__/spam.cpython-38-lno.pyc __pycache__/spam.cpython-38-ann.pyc (say). And if the external references are loaded lazily, on need, rather than eagerly, this could save startup time, which I think is the intention. The doc strings would be still available, just not loaded until the first time you try to use them. However, Python supports byte-code only distribution, using .pyc files external to the __pycache__. In that case, it would be annoying and inconvenient to distribute four top-level files, so I think that the use of external references has to be optional, and there has to be a way to either compile to a single .pyc file containing all four parts, or an external tool that can take the existing four files and merge them. -- Steve

On Wed, Apr 11, 2018 at 10:03 AM, Steven D'Aprano <steve@pearwood.info> wrote:
File system limits aren't usually an issue; as you say, even FAT32 can store a metric ton of files in a single directory. I'm more interested in how long it takes to open a file, and whether doubling that time will have a measurable impact on Python startup time. Part of that cost can be reduced by using openat(), on platforms that support it, but even with a directory handle, there's still a definite non-zero cost to opening and reading an additional file. ChrisA

On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:
Yes, it will double the number of files. Actually quadruple it, if the annotations and line numbers are in separate files too. But if most of those extra files never need to be opened, then there's no cost to them. And whatever extra cost there is, is amortized over the lifetime of the interpreter. The expectation here is that this could lead to reducing startup time, since the files which are read are smaller and less data needs to be read and traverse the network up front, but can be defered until they're actually needed. Serhiy is experienced enough that I think we should assume he's not going to push this optimization into production unless it actually does reduce startup time. He has proven himself enough that we should assume competence rather than incompetence :-) Here is the proposal as I understand it: - by default, change .pyc files to store annotations, docstrings and line numbers as references to external files which will be lazily loaded on-need; - single-file .pyc files must still be supported, but this won't be the default and could rely on an external "merge" tool; - objects that rely on docstrings or annotations, such as dataclass, may experience a (hopefully very small) increase of import time, since they may not be able to defer loading the extra files; - but in general, most modules should (we expect) see an decrease in the load time; - which will (we hope) reduce startup time; - libraries which make eager use of docstrings and annotations might even ship with the single-file .pyc instead (the library installer can look after that aspect), and so avoid any extra cost. Naturally pushing this into production will require benchmarks that prove this actually does improve startup time. I believe that Serhiy's reason for asking is to determine whether it is worth his while to experiment on this. There's no point in implementing these changes and benchmarking them, if there's no chance of it being accepted. So on the assumptions that: - benchmarking does demonstrate a non-trivial speedup of interpreter startup; - single-file .pyc files are still supported, for the use of byte-code only libraries; - and modules which are particularly badly impacted by this change are able to opt-out and use a single .pyc file; I see no reason not to support this idea if Serhiy (or someone else) is willing to put in the work. -- Steve

On Wed, Apr 11, 2018 at 1:02 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, if they are actually not needed. My question was about whether that is truly valid. Consider a very common use-case: an OS-provided Python interpreter whose files are all owned by 'root'. Those will be distributed with .pyc files for performance, but you don't want to deprive the users of help() and anything else that needs docstrings etc. So... are the docstrings lazily loaded or eagerly loaded? If eagerly, you've doubled the number of file-open calls to initialize the interpreter. (Or quadrupled, if you need annotations and line numbers and they're all separate.) If lazily, things are a lot more complicated than the original description suggested, and there'd need to be some semantic changes here.
Oh, I'm definitely assuming that he knows what he's doing :-) Doesn't mean I can't ask the question though. ChrisA

On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote: [...]
We're never really going to know the affect on performance without implementing and benchmarking the code. It might turn out that, to our surprise, three quarters of the std lib relies on loading docstrings during startup. But I doubt it.
What relevance is that they're owned by root?
If eagerly, you've doubled the number of file-open calls to initialize the interpreter.
I do not understand why you think this is even an option. Has Serhiy said something that I missed that makes this seem to be on the table? That's not a rhetorical question -- I may have missed something. But I'm sure he understands that doubling or quadrupling the number of file operations during startup is not an optimization.
What semantic change do you expect? There's an implementation change, of course, but that's Serhiy's problem to deal with and I'm sure that he has considered that. There should be no semantic change. When you access obj.__doc__, then and only then are the compiled docstrings for that module read from the disk. I don't know the current implementation of .pyc files, but I like Antoine's suggestion of laying it out in four separate areas (plus header), each one marshalled: code docstrings annotations line numbers Aside from code, which is mandatory, the three other sections could be None to represent "not available", as is the case when you pass -00 to the interpreter, or they could be some other sentinel that means "load lazily from the appropriate file", or they could be the marshalled data directly in place to support byte-code only libraries. As for the in-memory data structures of objects themselves, I imagine something like the __doc__ and __annotation__ slots pointing to a table of strings, which is not initialised until you attempt to read from the table. Or something -- don't pay too much attention to my wild guesses. The bottom line is, is there some reason *aside from performance* to avoid this? Because if the performance is worse, I'm sure Serhiy will be the first to dump this idea. -- Steve

On Wed, Apr 11, 2018 at 4:06 PM, Steven D'Aprano <steve@pearwood.info> wrote:
You have to predict in advance what you'll want to have in your pyc files. Can't create them on the fly.
In other words, attempting to access obj.__doc__ can actually go and open a file. Does it need to check if the file exists as part of the import, or does it go back to sys.path? If the former, you're right back with the eager loading problem of needing to do 2-4 times as many stat calls; if the latter, it's semantically different in that a change to sys.path can influence something that normally is preloaded.
Obviously it could be turned into just a performance question, but in that case everything has to be preloaded, and I doubt there's going to be any advantage. To be absolutely certain of retaining the existing semantics, there'd need to be some sort of anchoring to ensure that *this* .pyc file goes with *that* .pyc_docstrings file. Looking them up anew will mean that there's every possibility that you get the wrong file back. As a simple example, upgrading your Python installation while you have a Python script running can give you this effect already. Just import a few modules, then change everything on disk. If you now import a module that was already imported, you get it from cache (and the unmodified version); import something that wasn't imported already, and it goes to the disk. At the granularity of modules, this is seldom a problem (I can imagine some package modules getting confused by this, but otherwise not usually), but if docstrings are looked up separately - and especially if lnotab is too - you could happily import and use something (say, in a web server), then run updates, and then an exception requires you to look up a line number. Oops, a few lines got inserted into that file, and now all the line numbers are straight-up wrong. That's a definite behavioural change. Maybe it's one that's considered acceptable, but it definitely is a change. And if mutations to sys.path can do this, it's definitely a semantic change in Python. ChrisA

On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote: [...]
How is that different from the situation right now?
That's implementation, so I don't know, but I imagine that the module object will have a link pointing directly to the expected file on disk. No need to search the path, you just go directly to the expected file. Apart from handling the case when it doesn't exist, in which case the docstring or annotations get set to None, it should be relatively straight-forward. That link could be an explicit pathname: /path/to/__pycache__/foo.cpython-33-doc.pyc or it could be implicitly built when required from the "master" .pyc file's path, since the differences are likely to be deterministic.
Except that's not eager loading. When you open the file on demand, it might never be opened at all. If it is opened, it is likely to be a long time after interpreter startup.
You don't need to preload things to get a performance benefit. Preloading things that you don't need immediately and may never need at all, like docstrings, annotations and line numbers, is inefficient. I fear that you have completely failed to understand the (potential) performance benefit here. The point, or at least *a* point, of the exercise is to speed up interpreter startup by deferring some of the work until it is needed. When you defer work, the pluses are that it reduces startup time, and sometimes you can avoid doing it at all; the minus is that if you do end up needing to do it, you have to do a little bit extra. So let's look at a few common scenarios: 1. You run a script. Let's say that the script ends up loading, directly or indirectly, 200 modules, none of which need docstrings or annotations during runtime, and the script runs to completion without needing to display a traceback. You save loading 200 sets of docstrings, annotations and line numbers ("metadata" for brevity) so overall the interpreter starts up quicker and the script runs faster. 2. You run the same script, but this time it raises an exception and displays a traceback. So now you have to load, let's say, 20 sets of line numbers, which is a bit slower, but that doesn't happen until the exception is raised and the traceback printed, which is already a slow and exceptional case so who cares if it takes an extra few milliseconds? It is still an overall win because of the 180 sets of metadata you didn't need to load. 3. You have a long-running server application which runs for days or weeks between restarts. Let's say it loads 1000 modules, so you get significant savings during start up (let's say, hypothetically shaving off 2 seconds from a 30 second start up time), but over the course of the week it ends up eventually loading all 1000 sets of metadata. Since that is deferred until needed, it doesn't happen all at once, but spread out a little bit at a time. Overall, you end up doing four times as many file system operations, but since they're amortized over the entire week, not startup, it is still a win. (And remember that this extra cost only applies the first time a module's metadata is needed. It isn't a cost you keep paying over and over again.) We're (hopefully!) not going to care too much if the first few times the server needs to log a traceback, it hits the file system a few extra times. Logging tracebacks are already expensive, but they're also exceptional and so making them a bit more expensive is nevertheless likely to be an overall win if it makes startup faster. The cost/benefit accounting here is: we care far more about saving 2 seconds out of the 30 second startup (6% saving) than we care about spending an extra 8 seconds spread over a week (0.001% cost). 4. You're running the interactive interpreter. You probably aren't even going to notice the fact that it starts up a millisecond faster, or even 10 ms, but on the other hand you aren't going to notice either if the first time you call help(obj) it makes an extra four file system accesses and takes an extra few milliseconds. Likewise for tracebacks, you're not going to notice or care if it takes 350ms instead of 300ms to print a traceback. (Or however long it actually takes -- my care factor is too low to even try to measure it.) These are, in my opinion, typical scenarios. If you're in an atypical scenario, say all your modules are loaded over a network running over a piece of string stuck between two tin cans *wink*, then you probably will feel a lot more pain, but honestly that's not our problem. We're not obliged to optimize Python for running on broken networks. And besides, since we have to support byte-code only modules, and we want them to be a single .pyc file not four, people with atypical scenarios or make different cost/benefit tradeoffs can always opt-in to the single .pyc mode. [...]
As a simple example, upgrading your Python installation while you have a Python script running can give you this effect already.
Right -- so we're not adding any failure modes that don't already exist. It is *already* a bad idea to upgrade your Python installation, or even modify modules, while Python is running, since the source code may get out of sync with the cached line numbers and the tracebacks will become inaccurate. This is especially a problem when running in the interactive interpreter while editing the file you are running.
Indeed, but that's no different from what happens now when the same line number might point to a different line of source code.
Maybe it's one that's considered acceptable, but it definitely is a change.
I don't think it is a change, and I think it is acceptable. I think the solution is, don't upgrade your modules while you're still running them! -- Steve

On Thu, Apr 12, 2018 at 11:59 AM, Steven D'Aprano <steve@pearwood.info> wrote:
If the files aren't owned by root (more specifically, if they're owned by you, and you can write to the pycache directory), you can do everything at runtime. Otherwise, you have to do everything at installation time.
Referencing a path name requires that each directory in it be opened. Checking to see if the file exists requires, at absolute best, one more stat call, and that's assuming you have an open handle to the directory.
I have no idea what you mean here. Eager loading != opening the file on demand. Eager statting != opening on demand. If you're not going to hold open handles to heaps of directories, you have to reference everything by path name.
Right, and if you DON'T preload everything, you have a potential semantic difference. Which is exactly what you were asking me, and I was answering.
Does this loading happen when the exception is constructed or when it's printed? How much can you do with an exception without triggering the loading of metadata? Is it now possible for the mere formatting of a traceback to fail because of disk/network errors?
People DO run Python over networks, though, and people DO upgrade their Python installations.
Do you terminate every single Python process on your system before you upgrade Python? Let's say you're running a server on Red Hat Enterprise Linux or Debian Stable, and you go to apply all the latest security updates. Is that best done by shutting down every single application, THEN applying all updates, and only when that's all done, starting everything up? Or do you update everything on the disk, then pick one process at a time and signal it to restart? I don't know for sure about RHEL, but I do know that Debian's package management system involves a lot of Python. So it'd be a bit tricky to build your updater such that no Python is running during updates - you'd have to deploy a brand-new Python tree somewhere to use for installation, or something. And if you have any tiny little wrapper scripts written in Python, they could easily still be running across an update, even if the rest of the app is written in C. So, no. You should NOT have to take a blanket rule of "don't update while it's running". Instead, what you have is: "Binaries can safely be unlinked, and Python modules only get loaded when you import them".
Yes, this is true; but at least the mapping from byte code to line number is trustworthy. Worst case, you look at the traceback, and then interpret it based on an older copy of the .py file. If lnotab is loaded lazily, you don't even have that. Something's going to have to try to figure out what the mapping is.
If you need a solution to it, then it IS a change. Doesn't mean it can't be done, but it definitely is a change. (Look at the PEP 572 changes to list comprehensions at class scope. Nobody's denying that the semantics are changing; but normal usage won't ever witness the changes.) I don't think this is purely a performance question. ChrisA

On 04/11/18 06:21, Chris Angelico wrote:
Currently in Fedora, we ship *both* optimized and non-optimized pycs to make sure both -O and non--O will work nicely without root privilieges. So splitting the docstrings into a separate file would be, for us, a benefit in terms of file size.

On 4/11/2018 4:26 AM, Petr Viktorin wrote:
Currently, the Windows installer has an option to pre-compile stdlib modules. (At least it does if one does an all-users installation.) If one selects this, it creates normal, -O, and -OO versions of each. Since, like most people, I never run with -O or -OO, replacing this redundancy with 1 segmented file or 2 non-redundant files might be a win for most people. -- Terry Jan Reedy

On Tue, Apr 10, 2018 at 5:03 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Our product uses the doc strings for auto-generated help, so we need to keep those. We also allow users to write plugins and scripts, so getting valid feedback in tracebacks is essential for our support people, so we'll keep the lno files, too. Annotations can probably go. Looking at one of our little pyc files, I see: -rwx------+ 1 efahlgren admins 9252 Apr 10 17:25 ./lm/lib/config.pyc* Since disk blocks are typically 4096 bytes, that's really a 12k file. Let's say it's 8k of byte code, 1k of doc, a bit of lno. So the proposed layout would give: config.pyc -> 8k config-doc.pyc -> 4k config-lno.pyc -> 4k So now I've increased disk usage by 25% (yeah yeah, I know, I picked that small file on purpose to illustrate the point, but it's not unusual). These files are often opened over a network, at least for user plugins. This can take a really, really long time on some of our poorly connected machines, like 1-2 seconds per file (no kidding, it's horrible). Now instead of opening just one file in 1-2 seconds, we have increased the time by 300%, just to do the stat+open, probably another stat to make sure there's no "ann" file laying about. Ouch. -1 from me.

2018-04-11 2:03 GMT+02:00 Steven D'Aprano <steve@pearwood.info>: [snip]
[snip] Hi all, Just for information for everyone: (I was a VMS system manager more than a decade ago, and I know that Win NT (at least the core) is developed by a former VMS engineer. NTFS is created on the bases of Files-11 (Files-11B) file system. And in both file systems the directory is a tree (in Files-11 it is a B-tree, maybe in NTFS it is different tree, but tree). Holding the files ordered alphabetically. And if there are "too much" files then accessing files will be slower. (check for example the windows\system32 folder). Of course it is not matter if there are some hundred or 1-2 thousand files. But the too much matters. I did a little measurement (intentionally not used functions not to make the result wrong): import os import time try: os.mkdir('tmp_thousands_of_files') except: pass name1 = 10001 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name1) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'one file time {file_time} \n {start} \n {stop}') for i in range(10002, 20000): file_name = 'tmp_thousands_of_files/' + str(i) f = open(file_name, 'w') f.write('aaa') f.close() name2 = 10000 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name2) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'after 10k, name before {file_time} \n {start} \n {stop}') name3 = 20010 start = time.time() file_name = 'tmp_thousands_of_files/' + str(name3) f = open(file_name, 'w') f.write('aaa') f.close() stop = time.time() file_time = stop-start print(f'after 10k, name after {file_time} \n {start} \n {stop}') """ result c:\>python several_files_in_one_folder.py one file time 0.0 1523476699.5144918 1523476699.5144918 after 10k, name before 0.015625953674316406 1523476714.622918 1523476714.6385438 after 10k, name after 0.0 1523476714.6385438 1523476714.6385438 """ used: Python 3.6.1, windows 8.1, SSD drive As you can see, when there an insertion into the beginning of the tree it is much slower then adding to the end. (yes, I know the list insertion is slow as well, but I saw VMS directory with 50k files, and the dir command gave 5-10 files then waited some seconds before the next 5-10 files ... ;-) ) BR, George

10.04.18 20:38, Chris Angelico пише:
Yes, this will mean more syscalls when import with docstrings. But the startup time doesn't matter for interactive shell in which you call help(). It was expected that programs which need to gain the benefit from separating optional components will run without loading them (like with option -OO). The overhead can be reduced by packing multiple files in a single archive. Finally, loading docstrings and other optional components can be made lazy. This was not in my original idea, and this will significantly complicate the implementation, but in principle it is possible. This will require larger changes in the marshal format and bytecode. This can open a door for further enhancements: loading the code and building classes and other complex data (especially heavy namedtuples, enums and dataclasses) on demand. Often you need to use just a single attribute or function from a large module. But this is different change, out of scope of this topic.

I'm +1 on this idea. * New pyc format has code section (same to current) and text section. text section stores UTF-8 strings and not loaded at import time. * Function annotation (only when PEP 563 is used) and docstring are stored as integer, point to offset in the text section. * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are integer, text is loaded from the text section lazily. PEP 563 will reduce some startup time, but __annotation__ is still dict. Memory overhead is negligible. In [1]: def foo(a: int, b: int) -> int: ...: return a + b ...: ...: In [2]: import sys In [3]: sys.getsizeof(foo) Out[3]: 136 In [4]: sys.getsizeof(foo.__annotations__) Out[4]: 240 When PEP 563 is used, there are no side effect while building the annotation. So the annotation can be serialized in text, like {"a":"int","b":"int","return":"int"}. This change will require new pyc format, and descriptor for PyFunction.__doc__, PyFunction.__annotation__ and type.__doc__. Regards, -- INADA Naoki <songofacandy@gmail.com>

One implementation difficulty specifically related to annotations, is that they are quite hard to find/extract from the code objects. Both docstrings and lnotab are within specific fields of the code object for their function/class/module; annotations are spread as individual constants (assuming PEP 563), which are loaded in bytecode through separate LOAD_CONST statements before creating the function object, and that can happen in the middle of bytecode for the higher level object (the module or class containing a function definition). So the change for achieving that will be more significant than just "add a couple of descriptors to function objects and change the module marshalling code". Probably making annotations fit a single structure that can live in co_consts could make this change easier, and also make startup of annotated modules faster (because you just load a single constant instead of one per argument), this might be a valuable change by itself. On 12 April 2018 at 11:48, INADA Naoki <songofacandy@gmail.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

I've been playing a bit with this trying to collect some data and measure how useful this would be. You can take a look at the script I'm using at: https://github.com/dmoisset/pycstats What I'm measuring is: 1. Number of objects in the pyc, and how many of those are: * docstrings (I'm using a heuristic here which I'm not 100% sure it is correct) * lnotabs * Duplicate objects; these have not been discussed in this thread before but are another source of optimization I noticed while writing this. Essentially I'm refering to immutable constants that are instanced more than once and could be shared. You can also measure the effect of this optimization across modules and within a single module[1] 2. Bytes used in memory by the categories above (sum of sys.getsizeof() for each category). I'm not measuring anything related to annotations because, as I mentioned before, they are generated piecemeal by executable bytecode so they are hard to separate Running this on my python 3.6 pyc cache I get: $ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py 8645 docstrings, 1705441B 19060 lineno tables, 941702B 59382/202898 duplicate objects for 3101287/18582807 memory size So this means around ~10% of the memory used after loading is used for docstrings, ~5% for lnotabs, and ~15% for objects that could be shared. The sharing assumes we can share betwwen modules, but even doing it within modules, you can get to ~7%. In short, this could mean a 25%-35% reduction in memory use for code objects if the stdlib is a good benchmark. Best, D. [1] Regarding duplicates, I've found some unexpected things within loaded code objects, for example instances of the small integer "1" with different id() than the singleton that cpython normally uses for "1", although most duplicates are some small strings, tuples with argument names, or . Something that could be interesting to write is a "pyc optimizer" that removes duplicates, this should be a gain at a minimal preprocessing cost. On 12 April 2018 at 15:16, Daniel Moisset <dmoisset@machinalis.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

I think moving data out of pyc files is going in a wrong direction: more stat calls means slower import and slower startup time. Trying to make pycs smaller also isn't really worth it (they compress quite well). Saving memory could be done by disabling reading objects lazily from the file - without removing anything from the pyc file. Whether the few 100kB RAM this saves is worth the effort depends on the application space. This leaves the proposal to restructure pyc files into a sectioned file and possibly indexed file to make access to (lazily) loaded parts faster. More structure would add ways to more easily update the content going forward (similar to how PE executable files are structured) and allow us to get rid of extra pyc file variants (e.g. for special optimized versions). So that's an interesting approach :-) BTW: In all this, please remember that quite a few applications do use doc strings as part of the code, not only for documentation. Most prominent are probably parsers which keep the parsing definitions in doc strings. On 12.04.2018 20:32, Daniel Moisset wrote:
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Apr 12 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Fri, 13 Apr 2018 at 03:47, M.-A. Lemburg <mal@egenix.com> wrote:
+1. With this in place -O and -OO cmdline options would become even less useful (which is good). -- Giampaolo - http://grodola.blogspot.com

On 2018-04-12, M.-A. Lemburg wrote:
I would like to see a format can hold one or more modules in a single file. Something like the zip format but optimized for fast interpreter startup time. It should support lazy loading of module parts (e.g. maybe my lazy bytecode execution idea[1]). Obviously a lot of details to work out. The design should also take into account the widespread use of virtual environments. So, it should be easy and space efficient to build virtual environments using this format (e.g. maybe allow overlays so that stdlib package is not copied into virtual environment, virtual packages would be overlaid on stdlib file). Also, should be easy to bundle all modules into a "uber" package and append it to the Python executable. CPython should provide out-of-box support for single-file executables. 1. https://github.com/python/cpython/pull/6194

On Sat, 14 Apr 2018 at 17:01 Neil Schemenauer <nas-python-ideas@arctrix.com> wrote:
Eric Snow, Barry Warsaw, and I chatted about a custom file format for holding Python source (and data files). My notes on the chat can be found at https://notebooks.azure.com/Brett/libraries/design-ideas/html/Python%20sourc... . (And since we aren't trying to rewrite bytecode we figured it wouldn't break your proposal, Neil ;) . -Brett

I'm not sure I understand the benefit of this, perhaps you can clarify. What I see is two scenarios Scenario A) External files are present In this case, the data is loaded from the pyc and then from external file, so there are no savings in memory, startup time, disk space, or network load time, it's just the same disk information and runtime structure with a different layout Scenario B) External files are not present In this case, you get runtime improvements exactly identical to not having the data in the pyc which is roughly what you get with -OO. The only new capability I see this adds is the localization benefit, is that what this proposal is about? On 10 April 2018 at 17:14, Serhiy Storchaka <storchaka@gmail.com> wrote:
-- Daniel F. Moisset - UK Country Manager - Machinalis Limited www.machinalis.co.uk <http://www.machinalis.com> Skype: @dmoisset T: + 44 7398 827139 1 Fore St, London, EC2Y 9DT Machinalis Limited is a company registered in England and Wales. Registered number: 10574987.

On Tue, 10 Apr 2018 19:14:58 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
An alternate proposal would be to have separate sections in a single marshal file. The main section (containing the loadable module) would have references to the other sections. This way it's easy for the loader to say "all references to the docstring section and/or to the annotation section are replaced with None", depending on how Python is started. It would also be possible to do it on disk with a strip-like utility. I'm not volunteering to do all this, so just my 2 cents ;-) Regards Antoine.

On 11 April 2018 at 02:14, Serhiy Storchaka <storchaka@gmail.com> wrote:
While I don't think the default inline pyc format should change, in my ideal world I'd like to see the optimized format change to a side-loading model where these things are still emitted, but they're placed in a separate metadata file that isn't loaded by default. The metadata file would then be lazily loaded at runtime, such that `-O` gave you the memory benefits of `-OO`, but docstrings/annotations/source line references/etc could still be loaded on demand if something actually needed them. This approach would also mitigate the valid points Chris Angelico raises around hot reloading support - we could just declare that it requires even more care than usual to use hot reloading in combination with `-O`. Bonus points if the sideloaded metadata file could be designed in such a way that an extension module compiler like Cython or an alternate pyc compiler frontend like Hylang could use it to provide relevant references back to the original source code (JavaScript's source maps may provide inspiration on that front). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (22)
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
Daniel Moisset
-
Eric Fahlgren
-
Eric V. Smith
-
Erik Bray
-
Ethan Furman
-
George Fischhof
-
Giampaolo Rodola'
-
Gregory P. Smith
-
INADA Naoki
-
M.-A. Lemburg
-
Neil Schemenauer
-
Nick Coghlan
-
Petr Viktorin
-
Serhiy Storchaka
-
Stephan Houben
-
Steve Barnes
-
Steven D'Aprano
-
Terry Reedy
-
Zachary Ware