Permanent code objects (less memory, quicker load, less Unix Copy On Write)
Hi All Summary: Shared objects in Unix are a major influence. This proposal can be seen as a first step towards packaging pure Python modules as Unix shared objects. First, there's a high level overview. Then some technical stuff in Appendices. An object is transient if it can be garbage collected. An object is permanent if it will never be garbage collected. Every interpreted Python function has a code object (that contains instructions for the interpreter). Many of these code objects persist to the end of the program, and are used for little else than providing interpreter instructions. We show that extending Python, to provide and take advantage of permanent code objects, will bring some benefits. The cost is expected to be quite small. When a Python function is called, the interpreter increases the refcount of its code object. At the end of the function's execution, the interpreter decreases the refcount. (An example below shows this.) If Python were extended to take advantage of permanent code objects, then for example popular code objects could be loaded into memory in this way. This can reduce memory usage (by sharing immutable resources) and reduce startup time. In addition, a Unix forked process would have less need to do copy-on-write (see below). This is related to packaging pure Python modules as Unix shared objects. The core of implementing this change would be to provide if ... else ... branching, around the interpreter source code that changes the refcount of a code object. The interpreter itself will of course want direct access to the permanent code object. There is no harm in that. The cost is that unprivileged access to fn.__code__ will be slower, due to an additional indirection. However, as such commands are rarely executed in ordinary programs, the cost is expected to be small. It might be helpful, after checking the analysis and before coding, to do some simple timing tests and calculations to estimate the performance benefits and costs of making such a change. These would of course depend on the use case. I hope this helps. Jonathan APPENDICES =========== SOME IMPLEMENTATION DETAILS AND COMMENTS Because fn.__code__ must not return a permanent object, some sort of opaque proxy would be required. Because Python programs rarely inspect fn.__code__, in practice the cost of this additional indirection is likely to be small. As things are, the time spent changing the refcount of fn.__code__ is probably insignificant. The benefit is that permanent code objects are made immutable, and so can be stored safely in read-only memory (that can be shared across all processes and users). Code objects are special, in that they are only rarely looked at directly. Their main purpose is to be used by the interpreter. Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick. The transient / permanent nature of fn.__code__ could be stored as a hidden field on the fn object. This would reduce the cost of the if ... else ... branching, as it amounts to caching the transient / permanent nature of fn.__code__. FORK AND COPY ON WRITE On Unix, the fork system call causes a process to make a child of itself. The parent and child share memory. To avoid confusion and errors, when either asks the system to write to shared memory, the system ensures that both parent and child have their own copy (of the page of memory that is being written to). This is an expensive operation. See: https://en.wikipedia.org/wiki/Copy-on-write INTERPRETER SESSION >>> from sys import getrefcount as grc # Identical functions with different code objects. >>> def f1(obj): return grc(obj) >>> def f2(obj): return grc(obj) >>> f1.__code__ is f2.__code__ False # Initial values. >>> grc(f1.__code__), grc(f2.__code__) (2, 2) # Calling f1 increases the refcount of f1.__code__. >>> f1(f1), f1(f2), f2(f1), f2(f2) (6, 4, 4, 6) # If fn is a generator function, then x = fn() will increase the # refcount of fn.__code__. >>> def f1(): yield True >>> grc(f1.__code__) 2 # Let's create and store 10 generators. >>> iterables = [f1() for i in range(10)] >>> grc(f1.__code__) 22 # Let's get one item from each. >>> [next(i) for i in iterables] [True, True, True, True, True, True, True, True, True, True] >>> grc(f1.__code__) 22 # Let's exhaust all the iterables. This reduces the refcount. >>> [next(i, False) for i in iterables] [False, False, False, False, False, False, False, False, False, False] >>> grc(f1.__code__) 12 # Nearly done. Now let go of the iterables. >>> del iterables >>> grc(f1.__code__) 2
On 18 Jun 2020, at 10:36, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi All
Summary: Shared objects in Unix are a major influence. This proposal can be seen as a first step towards packaging pure Python modules as Unix shared objects.
First, there's a high level overview. Then some technical stuff in Appendices.
An object is transient if it can be garbage collected. An object is permanent if it will never be garbage collected. Every interpreted Python function has a code object (that contains instructions for the interpreter). Many of these code objects persist to the end of the program, and are used for little else than providing interpreter instructions.
We show that extending Python, to provide and take advantage of permanent code objects, will bring some benefits. The cost is expected to be quite small.
When a Python function is called, the interpreter increases the refcount of its code object. At the end of the function's execution, the interpreter decreases the refcount. (An example below shows this.)
If Python were extended to take advantage of permanent code objects, then for example popular code objects could be loaded into memory in this way. This can reduce memory usage (by sharing immutable resources) and reduce startup time.
In addition, a Unix forked process would have less need to do copy-on-write (see below). This is related to packaging pure Python modules as Unix shared objects.
The core of implementing this change would be to provide if ... else ... branching, around the interpreter source code that changes the refcount of a code object. The interpreter itself will of course want direct access to the permanent code object. There is no harm in that.
The cost is that unprivileged access to fn.__code__ will be slower, due to an additional indirection. However, as such commands are rarely executed in ordinary programs, the cost is expected to be small.
It might be helpful, after checking the analysis and before coding, to do some simple timing tests and calculations to estimate the performance benefits and costs of making such a change. These would of course depend on the use case.
To make the code avoid COW you would need to be able to make sure that all code memory blocks are not mixed in with PyObject memory blocks. Then the ref count dance will have trigger COW for the code. Barry
I hope this helps.
Jonathan
APPENDICES ===========
SOME IMPLEMENTATION DETAILS AND COMMENTS Because fn.__code__ must not return a permanent object, some sort of opaque proxy would be required. Because Python programs rarely inspect fn.__code__, in practice the cost of this additional indirection is likely to be small.
As things are, the time spent changing the refcount of fn.__code__ is probably insignificant. The benefit is that permanent code objects are made immutable, and so can be stored safely in read-only memory (that can be shared across all processes and users). Code objects are special, in that they are only rarely looked at directly. Their main purpose is to be used by the interpreter.
Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick. The transient / permanent nature of fn.__code__ could be stored as a hidden field on the fn object. This would reduce the cost of the if ... else ... branching, as it amounts to caching the transient / permanent nature of fn.__code__.
FORK AND COPY ON WRITE On Unix, the fork system call causes a process to make a child of itself. The parent and child share memory. To avoid confusion and errors, when either asks the system to write to shared memory, the system ensures that both parent and child have their own copy (of the page of memory that is being written to). This is an expensive operation. See: https://en.wikipedia.org/wiki/Copy-on-write <https://en.wikipedia.org/wiki/Copy-on-write>
INTERPRETER SESSION
>>> from sys import getrefcount as grc
# Identical functions with different code objects. >>> def f1(obj): return grc(obj) >>> def f2(obj): return grc(obj) >>> f1.__code__ is f2.__code__ False
# Initial values. >>> grc(f1.__code__), grc(f2.__code__) (2, 2)
# Calling f1 increases the refcount of f1.__code__. >>> f1(f1), f1(f2), f2(f1), f2(f2) (6, 4, 4, 6)
# If fn is a generator function, then x = fn() will increase the # refcount of fn.__code__. >>> def f1(): yield True >>> grc(f1.__code__) 2
# Let's create and store 10 generators. >>> iterables = [f1() for i in range(10)] >>> grc(f1.__code__) 22
# Let's get one item from each. >>> [next(i) for i in iterables] [True, True, True, True, True, True, True, True, True, True] >>> grc(f1.__code__) 22
# Let's exhaust all the iterables. This reduces the refcount. >>> [next(i, False) for i in iterables] [False, False, False, False, False, False, False, False, False, False] >>> grc(f1.__code__) 12
# Nearly done. Now let go of the iterables. >>> del iterables >>> grc(f1.__code__) 2
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SFCJS2... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Jun 18, 2020 at 9:34 AM Barry Scott <barry@barrys-emacs.org> wrote:
To make the code avoid COW you would need to be able to make sure that all code memory blocks are not mixed in with PyObject memory blocks.
Then the ref count dance will have trigger COW for the code.
indeed. cPython already has its own memory allocator, yes? how hard would it be to allocate all "immortal" objects in the same region in memory, and regular objects in another? Presumably that would only be a cost at allocation time, and probably not a large one. So the trick is to determine what objects are immortal. And Jonathan has a good point: though Python is perfectly capable of creating and destroying code objects (functions, classes, etc, in practice, most do survive the length of the program, so there would be little memory wasted in most cases by making them immortal. And maybe the interpreter could be smart about guessing which are most likely to be mortal. Finally, if this really does make a difference, then we could add ways for the programmer to mark certain code objects as mortal or immortal as need be. Finally -- and I'm way out of my depth here -- does this mean there is potential to significantly improve the performance of multiprocessing? Which would be really, really, great, as the GIL has proven an intractable barrier to certain kinds of multi-threading. -CHB
Barry
I hope this helps.
Jonathan
APPENDICES ===========
SOME IMPLEMENTATION DETAILS AND COMMENTS Because fn.__code__ must not return a permanent object, some sort of opaque proxy would be required. Because Python programs rarely inspect fn.__code__, in practice the cost of this additional indirection is likely to be small.
As things are, the time spent changing the refcount of fn.__code__ is probably insignificant. The benefit is that permanent code objects are made immutable, and so can be stored safely in read-only memory (that can be shared across all processes and users). Code objects are special, in that they are only rarely looked at directly. Their main purpose is to be used by the interpreter.
Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick. The transient / permanent nature of fn.__code__ could be stored as a hidden field on the fn object. This would reduce the cost of the if ... else ... branching, as it amounts to caching the transient / permanent nature of fn.__code__.
FORK AND COPY ON WRITE On Unix, the fork system call causes a process to make a child of itself. The parent and child share memory. To avoid confusion and errors, when either asks the system to write to shared memory, the system ensures that both parent and child have their own copy (of the page of memory that is being written to). This is an expensive operation. See: https://en.wikipedia.org/wiki/Copy-on-write
INTERPRETER SESSION
>>> from sys import getrefcount as grc
# Identical functions with different code objects. >>> def f1(obj): return grc(obj) >>> def f2(obj): return grc(obj) >>> f1.__code__ is f2.__code__ False
# Initial values. >>> grc(f1.__code__), grc(f2.__code__) (2, 2)
# Calling f1 increases the refcount of f1.__code__. >>> f1(f1), f1(f2), f2(f1), f2(f2) (6, 4, 4, 6)
# If fn is a generator function, then x = fn() will increase the # refcount of fn.__code__. >>> def f1(): yield True >>> grc(f1.__code__) 2
# Let's create and store 10 generators. >>> iterables = [f1() for i in range(10)] >>> grc(f1.__code__) 22
# Let's get one item from each. >>> [next(i) for i in iterables] [True, True, True, True, True, True, True, True, True, True] >>> grc(f1.__code__) 22
# Let's exhaust all the iterables. This reduces the refcount. >>> [next(i, False) for i in iterables] [False, False, False, False, False, False, False, False, False, False] >>> grc(f1.__code__) 12
# Nearly done. Now let go of the iterables. >>> del iterables >>> grc(f1.__code__) 2
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SFCJS2... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TZO7AG... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 18 Jun 2020, at 18:37, Christopher Barker <pythonchb@gmail.com> wrote:
On Thu, Jun 18, 2020 at 9:34 AM Barry Scott <barry@barrys-emacs.org <mailto:barry@barrys-emacs.org>> wrote: To make the code avoid COW you would need to be able to make sure that all code memory blocks are not mixed in with PyObject memory blocks.
Then the ref count dance will have trigger COW for the code.
indeed. cPython already has its own memory allocator, yes? how hard would it be to allocate all "immortal" objects in the same region in memory, and regular objects in another? Presumably that would only be a cost at allocation time, and probably not a large one. So the trick is to determine what objects are immortal.
And Jonathan has a good point: though Python is perfectly capable of creating and destroying code objects (functions, classes, etc, in practice, most do survive the length of the program, so there would be little memory wasted in most cases by making them immortal. And maybe the interpreter could be smart about guessing which are most likely to be mortal. Finally, if this really does make a difference, then we could add ways for the programmer to mark certain code objects as mortal or immortal as need be.
The key part of the idea is that the memory holding the ref count is not adjacent to the memory holding the objects state. Further that rarely modified state should be kept away from usually modified state. PyObject in the ref-count heap. Code objects in rarely-modified heap. List and Dict etc in the usually-modified heap. I'm assuming that its on longer true that a PyObject header is prepended to the memory of an objects state. Something like this has been described on I think the CAPI mailing list in some form. Barry
Finally -- and I'm way out of my depth here -- does this mean there is potential to significantly improve the performance of multiprocessing? Which would be really, really, great, as the GIL has proven an intractable barrier to certain kinds of multi-threading.
-CHB
Barry
I hope this helps.
Jonathan
APPENDICES ===========
SOME IMPLEMENTATION DETAILS AND COMMENTS Because fn.__code__ must not return a permanent object, some sort of opaque proxy would be required. Because Python programs rarely inspect fn.__code__, in practice the cost of this additional indirection is likely to be small.
As things are, the time spent changing the refcount of fn.__code__ is probably insignificant. The benefit is that permanent code objects are made immutable, and so can be stored safely in read-only memory (that can be shared across all processes and users). Code objects are special, in that they are only rarely looked at directly. Their main purpose is to be used by the interpreter.
Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick. The transient / permanent nature of fn.__code__ could be stored as a hidden field on the fn object. This would reduce the cost of the if ... else ... branching, as it amounts to caching the transient / permanent nature of fn.__code__.
FORK AND COPY ON WRITE On Unix, the fork system call causes a process to make a child of itself. The parent and child share memory. To avoid confusion and errors, when either asks the system to write to shared memory, the system ensures that both parent and child have their own copy (of the page of memory that is being written to). This is an expensive operation. See: https://en.wikipedia.org/wiki/Copy-on-write <https://en.wikipedia.org/wiki/Copy-on-write>
INTERPRETER SESSION
>>> from sys import getrefcount as grc
# Identical functions with different code objects. >>> def f1(obj): return grc(obj) >>> def f2(obj): return grc(obj) >>> f1.__code__ is f2.__code__ False
# Initial values. >>> grc(f1.__code__), grc(f2.__code__) (2, 2)
# Calling f1 increases the refcount of f1.__code__. >>> f1(f1), f1(f2), f2(f1), f2(f2) (6, 4, 4, 6)
# If fn is a generator function, then x = fn() will increase the # refcount of fn.__code__. >>> def f1(): yield True >>> grc(f1.__code__) 2
# Let's create and store 10 generators. >>> iterables = [f1() for i in range(10)] >>> grc(f1.__code__) 22
# Let's get one item from each. >>> [next(i) for i in iterables] [True, True, True, True, True, True, True, True, True, True] >>> grc(f1.__code__) 22
# Let's exhaust all the iterables. This reduces the refcount. >>> [next(i, False) for i in iterables] [False, False, False, False, False, False, False, False, False, False] >>> grc(f1.__code__) 12
# Nearly done. Now let go of the iterables. >>> del iterables >>> grc(f1.__code__) 2
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ <https://mail.python.org/mailman3/lists/python-ideas.python.org/> Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SFCJS2... <https://mail.python.org/archives/list/python-ideas@python.org/message/SFCJS2...> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ <https://mail.python.org/mailman3/lists/python-ideas.python.org/> Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TZO7AG... <https://mail.python.org/archives/list/python-ideas@python.org/message/TZO7AG...> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
-- Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Thu, Jun 18, 2020 at 06:49:13PM +0100, Barry Scott wrote:
The key part of the idea is that the memory holding the ref count is not adjacent to the memory holding the objects state. Further that rarely modified state should be kept away from usually modified state.
Isn't that going to play havoc with modern CPU pipelines and pre-fetching? Every time a function is called, there will be cache-misses galore and performance plummets? I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out. https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code When I'm programming at the Python level I don't need to worry about any of this (and in fact I can't do anything about it), but I expect that for the C implementation, that will be pretty critical. -- Steven
On 19/06/20 9:28 am, Steven D'Aprano wrote:
I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out.
Python objects are already scattered all over memory, and a function already consists of several objects -- the function object itself, a dict, a code object, lists of argument and local variable names, etc. I doubt whether the suggested change would make locality noticeably worse. -- Greg
Hi Greg You wrote: On 19/06/20 9:28 am, Steven D'Aprano wrote:
I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out.
Python objects are already scattered all over memory, and a function already consists of several objects -- the function object itself, a dict, a code object, lists of argument and local variable names, etc. I doubt whether the suggested change would make locality noticeably worse.
I like this response. I'd also add a link to https://en.wikipedia.org/wiki/Non-uniform_memory_access NUMA arises because quick memory is expensive, with CPU registers the quickest of all, and revolving hard disks (and then tape) at the slow end, and RAM and SSD somewhere in the middle. In addition, modern CPUs are multicore, each with its own registers and cache. Optimising such systems is difficult, particularly as what's best depends on the data being processed. One part of testing would be to provide guidance as to the situations, in which permanent code objects would bring some benefit. Even with UMA (no caches etc), there could be the benefit of reduced use of memory (as no longer keeping two almost identical copies of the same object). An aside. Precisely with UMA there is no benefit to locality. I'd say getting as much of the busy code as you can into the CPU caches will bring benefits. Unix introduces shared objects, and on Linux C-coded extensions are available as shared objects. >>> import lxml.etree as etree >>> etree <module 'lxml.etree' from '[...] etree.cpython-36m-x86_64-linux-gnu.so'> Summary: The proposal is that the Python interpreter be extended, so that it can access pure Python code objects that are part of a Unix shared object. I hope this helps. Thank you, Greg and Steve, for your interest and contributions. -- Jonathan
On Fri, Jun 19, 2020 at 06:33:59PM +1200, Greg Ewing wrote:
On 19/06/20 9:28 am, Steven D'Aprano wrote:
I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out.
Python objects are already scattered all over memory, and a function already consists of several objects -- the function object itself, a dict, a code object, lists of argument and local variable names, etc. I doubt whether the suggested change would make locality noticeably worse.
There's a difference between "I doubt" and "it won't". Unless you're an expert on C level optimizations, like Victor or Serhiy, which I definitely am not, I think we're both just guessing. Here is some evidence that cache misses makes a real difference for performance. A 70% slow down on calling functions, due to an increase in L1 cache misses: https://bugs.python.org/issue28618 There's also been a lot of work done on using immortal objects. The results have been mixed at best: https://bugs.python.org/issue40255 Jonathan's initial post claimed to have shown that this technique will be of benefit: "We show that extending Python, to provide and take advantage of permanent code objects, will bring some benefits." but there is a huge gulf between faster in theory and faster in practice and we should temper our enthusiasm and not claim certainty in the face of a vast gulf of uncertainty. I know that Python-Ideas is not held to the same standards as scientific papers, but if you claim to have shown a benefit, you really ought to have *actually* shown a benefit, not just identified a promising area for future study. Making objects immortal is not free, it risks memory leaks, and the evidence (as far as I can tell) is that it helps only a small subset of Python users (those that fork() lots of worker processes) at the expense of the majority of users. Personally, based on my *extremely limited* (i.e. infinitesimal) knowledge of C-level optimizations on 21st century CPUs, I don't think this is a promising area to explore, except maybe as an option. (A runtime switch, perhaps, or a build option?) If Jonathan, or anyone else, thinks differently and is willing to run some before and after benchmarks, I look forward to being proven wrong :-) -- Steven
On 20/06/20 1:15 pm, Steven D'Aprano wrote:
Here is some evidence that cache misses makes a real difference for performance. A 70% slow down on calling functions, due to an increase in L1 cache misses:
There's no doubt that cache misses are a big issue for machine instructions. But the same reasoning doesn't automatically extend to bytecode instructions, or other data in Python objects. The interpreter executes tens to hundreds of machine instructions for each bytecode instruction fetched.
but there is a huge gulf between faster in theory and faster in practice
Yes, and also between slower in theory and slower in practice. There's no substitute for measurement when it comes to things like this. -- Greg
On 18 Jun 2020, at 22:28, Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jun 18, 2020 at 06:49:13PM +0100, Barry Scott wrote:
The key part of the idea is that the memory holding the ref count is not adjacent to the memory holding the objects state. Further that rarely modified state should be kept away from usually modified state.
Isn't that going to play havoc with modern CPU pipelines and pre-fetching?
Not for instructions no.
Every time a function is called, there will be cache-misses galore and performance plummets?
Is there? Whats to say that all the memory that is involved is not in cache-lines in the cache? I cannot guess what the performance will be. I'd need to ran performance tests to see what happens.
I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out.
Remember that the caches are large these days and can hold many cache lines. So the idea of local really means is in the cache. If this idea works then I think the winner will be apps that fork lots child processes that will not modify memory setup by the parent. In other words, while this is an interest thought experiment, I think its not going to help single process python programs. The patch set that Guido hilighted is from just such an App.
https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code
When I'm programming at the Python level I don't need to worry about any of this (and in fact I can't do anything about it), but I expect that for the C implementation, that will be pretty critical.
Indeed. Barry
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JN4ABC... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, 19 Jun 2020 20:30:03 +0100 Barry Scott <barry@barrys-emacs.org> wrote:
I know very little about how this works except a vague rule of thumb that in the 21st century memory locality is king. If you want code to be fast, keep it close together, not spread out.
Remember that the caches are large these days and can hold many cache lines. So the idea of local really means is in the cache.
There's no such thing as "the cache". There are usually several levels of cache. L1 cache is closest to the CPU and is very fast (with latencies of access 3-4 cycle, and very high bandwidths). L2 cache is a bit further, still relatively fast, and often private per-core (typical latencies of access around 10-20 cycles). L3 cache is actually often relatively slow and shared between several or all cores (latencies vary quite a bit between CPU models, and can even be variable depending on exact data placement, but typical numbers are around 30-50 cycles). (3 levels of cache is the most used granularity these days, but you may find CPUs without a L3, or even with a L3 and a L4) So if you hit in your L3 cache but miss in your L2 cache, you're already taking a large toll during which your CPU may be stalled, waiting for data. Of course, accessing main memory is much slower yet (with latencies in the several hundreds of cycles, depending on memory speed and CPU frequency). Now, the typical L2 cache size is around 512 kiB, and the footprint of the average Python application is certainly much larger than that. This is mitigated by the fact that modern CPUs have sophisticated prefetching techniques that try to *predict* the memory ranges that will be accessed by future instructions, so as to read them from memory into (L1 or L2) cache in advance. However, this only works if those addresses are predictable at all. For example, accesses in a hash table (such as a dict) are by construction rarely predictable. Regards Antoine.
On Sun, 21 Jun 2020 11:07:05 +0200 Antoine Pitrou <solipsis@pitrou.net> wrote:
There's no such thing as "the cache". There are usually several levels of cache. L1 cache is closest to the CPU [...]
... Note by "closest to the CPU" I really mean "closest to the CPU core's execution units". Those caches are on-chip, otherwise communication would be much too slow. Regards Antoine.
Hi Barry Thank you for your interest in my proposal. Let me try to answer your question. You wrote: To make the code avoid COW you would need to be able to make sure that all
code memory blocks are not mixed in with PyObject memory blocks.
Then the ref count dance will have trigger COW for the code.
The relevant parts of my proposal are (emphasis added) When a Python function is called, the interpreter increases the refcount of
its code object. At the end of the function's execution, the interpreter decreases the refcount. (An example below shows this.)
**If Python were extended to take advantage of permanent code objects**, then for example popular code objects could be loaded into memory in this way. This can reduce memory usage (by sharing immutable resources) and reduce startup time.
I wanted this to mean that if the interpreter needs access to a permanent code object, it simply accesses it, without changing any reference counts. For this reason, permanent code objects don't need a refcount field. And without that field, there is no refcount dance. The interpreter has privileged access to permanent code objects (just as it has to its own internals, unless deliberately exposed). However, the code that is being interpreted has no such access. To enable something like >>> fn.__code__.co_code b'whatever-the-bytecodes-are' Python must have fn.__code__ return something that DOES have a refcount field. It could be some sort of proxy object. Or it could be a 'transient copy' of the permanent code object. I tried to explain this in the Appendix, perhaps not well enough. I hope this helps. -- Jonathan
On 18 Jun 2020, at 18:42, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi Barry
Thank you for your interest in my proposal. Let me try to answer your question. You wrote:
To make the code avoid COW you would need to be able to make sure that all code memory blocks are not mixed in with PyObject memory blocks.
Then the ref count dance will have trigger COW for the code.
The relevant parts of my proposal are (emphasis added)
When a Python function is called, the interpreter increases the refcount of its code object. At the end of the function's execution, the interpreter decreases the refcount. (An example below shows this.)
**If Python were extended to take advantage of permanent code objects**, then for example popular code objects could be loaded into memory in this way. This can reduce memory usage (by sharing immutable resources) and reduce startup time.
I wanted this to mean that if the interpreter needs access to a permanent code object, it simply accesses it, without changing any reference counts. For this reason, permanent code objects don't need a refcount field. And without that field, there is no refcount dance.
The interpreter has privileged access to permanent code objects (just as it has to its own internals, unless deliberately exposed).
However, the code that is being interpreted has no such access. To enable something like >>> fn.__code__.co_code b'whatever-the-bytecodes-are' Python must have fn.__code__ return something that DOES have a refcount field. It could be some sort of proxy object. Or it could be a 'transient copy' of the permanent code object.
I tried to explain this in the Appendix, perhaps not well enough. I hope this helps.
Did my last reply cover a possible implementation of this? e.g. The code is nowhere near the ref-count that triggers COW. Barry
-- Jonathan
Hi Barry You wrote:
Did my last reply cover a possible implementation of this? e.g. The code is nowhere near the ref-count that triggers COW.
Could say, do you think it's possible to extend Python so that it can use permanent code objects, when they are made available? For the moment, that is the main question. Is the basic scheme possible? (For me, I use Unix shared objects as an existence proof.) If so, then I suspect that we are focused on different parts of the implementation details. Oh, and thank you Christopher for your encouraging reply. -- Jonathan
On 18 Jun 2020, at 19:00, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi Barry
You wrote: Did my last reply cover a possible implementation of this? e.g. The code is nowhere near the ref-count that triggers COW.
Could say, do you think it's possible to extend Python so that it can use permanent code objects, when they are made available?
We need to define terms here. What do you mean by permanent? If you mean that after forking the child does not trigger COW on the code that is not modified then, yes I think that can be implemented. Also the child can modify the code if that is part of its algorithms without problem.
For the moment, that is the main question. Is the basic scheme possible? (For me, I use Unix shared objects as an existence proof.)
Shared objects are not the same as forking the address space and avoiding COW. In the case of .so (and .dll) files the memory used for the code and read-only data in an .so is only loaded into memory once and the operating system makes the page tables point to the one copy. Further paging of that data does use the linux swap file, aka page file. .pyc files cannot be used in the same way as .so can, by simple mapping the file into memory and paging in the code as you touch it. Being able to create python share object files is another problem.
If so, then I suspect that we are focused on different parts of the implementation details.
I think thats right. I think I am showing how what you want, if I have not misunderstood, could be implemented,
Oh, and thank you Christopher for your encouraging reply.
Indeed! Barry
-- Jonathan
Hi Barry You wrote: We need to define terms here. What do you mean by permanent?
Good question. I think I answered it in my original post: An object is transient if it can be garbage collected. An object is
permanent if it will never be garbage collected.
You also wrote:
Being able to create python share object files is another problem.
At the level I'm working at, that's what might unkindly be called an implementation detail. It is in fact a very important matter. But a simple-minded implementation would be enough to produce a proof-of-concept. I think I am showing how what you want, if I have not misunderstood, could
be implemented.
Oh, that's very good. In my original post my focus was on explaining as clearly as I could the basic concepts. For that reason (and others) I said little about implementation. If you like, I'd be very happy to discuss implementation with you and others, either here or on the Python C-API list. However, I'd like to put that off for a few days. And this would give others on this list an opportunity to comment. -- Jonathan
On 18 Jun 2020, at 19:30, Jonathan Fine <jfine2358@gmail.com> wrote:
Hi Barry
You wrote:
We need to define terms here. What do you mean by permanent?
Good question. I think I answered it in my original post:
An object is transient if it can be garbage collected. An object is permanent if it will never be garbage collected.
Aha. Ok that what I think of as rarely modified and only shared by forking.
You also wrote:
Being able to create python share object files is another problem.
At the level I'm working at, that's what might unkindly be called an implementation detail. It is in fact a very important matter. But a simple-minded implementation would be enough to produce a proof-of-concept.
It's a huge design problem. There are no simple PoC's that come to my mind.
I think I am showing how what you want, if I have not misunderstood, could be implemented.
Oh, that's very good. In my original post my focus was on explaining as clearly as I could the basic concepts. For that reason (and others) I said little about implementation. If you like, I'd be very happy to discuss implementation with you and others, either here or on the Python C-API list.
The reason that this is a problem in the first place is an implementation detail... Barry
However, I'd like to put that off for a few days. And this would give others on this list an opportunity to comment. -- Jonathan
Hello, I think you forgot the all-important parts: 1) How does it work technically? 2) What performance gain on which benchmark? Regards Antoine. On Thu, 18 Jun 2020 10:36:11 +0100 Jonathan Fine <jfine2358@gmail.com> wrote:
Hi All
Summary: Shared objects in Unix are a major influence. This proposal can be seen as a first step towards packaging pure Python modules as Unix shared objects.
First, there's a high level overview. Then some technical stuff in Appendices.
An object is transient if it can be garbage collected. An object is permanent if it will never be garbage collected. Every interpreted Python function has a code object (that contains instructions for the interpreter). Many of these code objects persist to the end of the program, and are used for little else than providing interpreter instructions.
We show that extending Python, to provide and take advantage of permanent code objects, will bring some benefits. The cost is expected to be quite small.
When a Python function is called, the interpreter increases the refcount of its code object. At the end of the function's execution, the interpreter decreases the refcount. (An example below shows this.)
If Python were extended to take advantage of permanent code objects, then for example popular code objects could be loaded into memory in this way. This can reduce memory usage (by sharing immutable resources) and reduce startup time.
In addition, a Unix forked process would have less need to do copy-on-write (see below). This is related to packaging pure Python modules as Unix shared objects.
The core of implementing this change would be to provide if ... else ... branching, around the interpreter source code that changes the refcount of a code object. The interpreter itself will of course want direct access to the permanent code object. There is no harm in that.
The cost is that unprivileged access to fn.__code__ will be slower, due to an additional indirection. However, as such commands are rarely executed in ordinary programs, the cost is expected to be small.
It might be helpful, after checking the analysis and before coding, to do some simple timing tests and calculations to estimate the performance benefits and costs of making such a change. These would of course depend on the use case.
I hope this helps.
Jonathan
APPENDICES ===========
SOME IMPLEMENTATION DETAILS AND COMMENTS Because fn.__code__ must not return a permanent object, some sort of opaque proxy would be required. Because Python programs rarely inspect fn.__code__, in practice the cost of this additional indirection is likely to be small.
As things are, the time spent changing the refcount of fn.__code__ is probably insignificant. The benefit is that permanent code objects are made immutable, and so can be stored safely in read-only memory (that can be shared across all processes and users). Code objects are special, in that they are only rarely looked at directly. Their main purpose is to be used by the interpreter.
Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick. The transient / permanent nature of fn.__code__ could be stored as a hidden field on the fn object. This would reduce the cost of the if ... else ... branching, as it amounts to caching the transient / permanent nature of fn.__code__.
FORK AND COPY ON WRITE On Unix, the fork system call causes a process to make a child of itself. The parent and child share memory. To avoid confusion and errors, when either asks the system to write to shared memory, the system ensures that both parent and child have their own copy (of the page of memory that is being written to). This is an expensive operation. See: https://en.wikipedia.org/wiki/Copy-on-write
INTERPRETER SESSION
>>> from sys import getrefcount as grc
# Identical functions with different code objects. >>> def f1(obj): return grc(obj) >>> def f2(obj): return grc(obj) >>> f1.__code__ is f2.__code__ False
# Initial values. >>> grc(f1.__code__), grc(f2.__code__) (2, 2)
# Calling f1 increases the refcount of f1.__code__. >>> f1(f1), f1(f2), f2(f1), f2(f2) (6, 4, 4, 6)
# If fn is a generator function, then x = fn() will increase the # refcount of fn.__code__. >>> def f1(): yield True >>> grc(f1.__code__) 2
# Let's create and store 10 generators. >>> iterables = [f1() for i in range(10)] >>> grc(f1.__code__) 22
# Let's get one item from each. >>> [next(i) for i in iterables] [True, True, True, True, True, True, True, True, True, True] >>> grc(f1.__code__) 22
# Let's exhaust all the iterables. This reduces the refcount. >>> [next(i, False) for i in iterables] [False, False, False, False, False, False, False, False, False, False] >>> grc(f1.__code__) 12
# Nearly done. Now let go of the iterables. >>> del iterables >>> grc(f1.__code__) 2
Hi Antoine Thank you for your interest. You wrote: I think you forgot the all-important parts:
1) How does it work technically? 2) What performance gain on which benchmark?
In my original post I wrote: It might be helpful, after checking the analysis and before coding, to do
some simple timing tests and calculations to estimate the performance benefits and costs of making such a change. These would of course depend on the use case.
I hope that adequately answers your second question. The technical appendix I provided made a start on answering your first question. My first goal was to produce a satisfactory description of the problem, and in general terms of how it might be solved. I'll be available to discuss how it might work technically, next week. I hope this helps. -- Jonathan
On Thu, Jun 18, 2020 at 5:36 AM Jonathan Fine <jfine2358@gmail.com> wrote:
Python allows the user to replace fn.__code__ by a different code object. This is a rarely done dirty trick.
A dirty trick to you maybe, but occasionally useful. For example, it can be used to implement goto: https://github.com/snoack/python-goto (I have used this once or twice.) With that said, your proposal is unclear to me on whether this would force immutability on all code objects (and thereby prevent all bytecode modification), or whether it would have an opt-out (or opt-in) mechanism. I'm a solid -1 if this forces immutability on all code objects with no way to opt out for those cases where you do wish to modify the bytecode. Otherwise, I have no opinion (as I lack knowledge of the concepts your proposal is based on).
On Thu, Jun 18, 2020 at 09:30:30PM -0400, Jonathan Goble wrote:
With that said, your proposal is unclear to me on whether this would force immutability on all code objects (and thereby prevent all bytecode modification), or whether it would have an opt-out (or opt-in) mechanism.
Code objects are already immutable. py> def f(): pass ... py> o = f.__code__ py> o.co_consts = (1,) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: readonly attribute -- Steven
One thought that I had is the fact that this whole proposal seems to be based on code blocks never needing to be collected? given the program: def fun1(v): return v def fun2(v) return v+1 fun1 = fun2 The function code block that was originally bound to the name fun1 should now be collected, as it is no longer referenced. And, if this code is all in module mod1, there could be way off somewhere else, in code that imports this module a statement like mod1.fun1 = fun2 The classic "Monkey Patch" So the compiler can't know if the code object really will be eternal, at best it knows that the code object can only be generated once by how it was compiled, so if we allow it to be leaked we have a finite loss.
Hi Richard Thank you for your interest. You wrote: One thought that I had is the fact that this whole proposal seems to be
based on code blocks never needing to be collected?
That's not quite what I meant to say. One part of the basic idea is that permanent code objects be made available to the Python interpreter. The second part is that the interpreter does not keep reference counts to these permanent objects. What I did not say explicitly, or not clearly enough, was that the previous use would continue unchanged. The only change would be that a function object would have a flag, which would tell the interpreter whether the associated code object was transient or permanent. (This, as I recall, I did mention in my original post.) Thank you for providing an example. This has made your concern very clear. It is my intention that this example, and also 'change the code monkey patching' would continue to work as before. To summarise: Permanent code objects are to be optional. Use them only if they help in some way. I hope this helps. Once again, thank you for your interest and contribution. with best wishes Jonathan
On Fri, Jun 19, 2020 at 09:36:24AM +0100, Jonathan Fine wrote:
What I did not say explicitly, or not clearly enough, was that the previous use would continue unchanged. The only change would be that a function object would have a flag, which would tell the interpreter whether the associated code object was transient or permanent. (This, as I recall, I did mention in my original post.)
Who sets the flag? If user code can set the flag, then users can maliciously or accidentally set the flag on code objects which should be garbage collected, causing a memory leak, or clear the flag on code objects which should not be garbage collected, potentially causing a segfault. I think that if anyone is imagining a process where the interpreter does something like this: # this happens on every single reference to any object if type(obj) is a code object and flag is set: pass else: increment the reference count (and similar for decrements) the cost of checking the type and/or flag is going to be significant. According to issue 40255, a similar experiment lead to a 10% slowdown. https://bugs.python.org/issue40255 There are more practical ways to implement immortal objects. I don't know what they are :-) but they must exist because Python had them before (and maybe still does?) and people keep experimenting with them. -- Steven
I remember vaguely that about two decades ago Greg Stein hatched an idea for code objects loaded from a read-only segment in shared libraries. I believe we went as far as ensuring that the interpreter could read bytecode from other things that strings, and I vaguely recall seeing a design for a layout of the code object as well. But the idea was never consummated, and I think there were unsolved problems regarding reference counts and constants incorporated in the code object. The objectives were the same as the subject line of this thread, and I believe so were the objections. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
Hi All Guido wrote: I remember vaguely that about two decades ago Greg Stein hatched an idea
for code objects loaded from a read-only segment in shared libraries.
[Thank you for this, Guido. Your memory is good.] Here's a thread from 2009, where Guido said: Greg Stein reached this same conclusion (and similar numbers) over 10 years ago ... Subject: Remove GIL with CAS instructions? https://mail.python.org/archives/list/python-ideas@python.org/thread/6ZONFLM... I looked up https://en.wikipedia.org/wiki/Compare-and-swap to read about CAS. Guido said this in the context of Antione's statement: Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL. In 2007 Guido posted to Artima: It isn't Easy to Remove the GIL: https://www.artima.com/weblogs/viewpost.jsp?thread=214235 In this post Guido writes: In 1999 Greg Stein (with Mark Hammond?) produced a fork of Python (1.5 I believe) that removed the GIL, replacing it with fine-grained locks on all mutable data structures. [...] However, after benchmarking, it was shown that even on the platform with the fastest locking primitive (Windows at the time) it slowed down single-threaded execution nearly two-fold. Guido also referenced this write-up from Greg: https://mail.python.org/pipermail/python-dev/2001-August/017099.html I hope this helps. -- Jonathan
Hm, I remember Greg's free threading too, but that's not the idea I was trying to recall this time. There really was something about bytecode objects being loaded from a read-only segment to speed up code loading. (Much quicker than unmarshalling a .pyc file.) I don't think we ever got the details worked out to the point where we could benchmark. On Sat, Jun 20, 2020 at 4:57 AM Jonathan Fine <jfine2358@gmail.com> wrote:
Hi All
Guido wrote:
I remember vaguely that about two decades ago Greg Stein hatched an idea
for code objects loaded from a read-only segment in shared libraries.
[Thank you for this, Guido. Your memory is good.]
Here's a thread from 2009, where Guido said: Greg Stein reached this same conclusion (and similar numbers) over 10 years ago ...
Subject: Remove GIL with CAS instructions?
https://mail.python.org/archives/list/python-ideas@python.org/thread/6ZONFLM...
I looked up https://en.wikipedia.org/wiki/Compare-and-swap to read about CAS.
Guido said this in the context of Antione's statement: Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL.
In 2007 Guido posted to Artima: It isn't Easy to Remove the GIL: https://www.artima.com/weblogs/viewpost.jsp?thread=214235
In this post Guido writes: In 1999 Greg Stein (with Mark Hammond?) produced a fork of Python (1.5 I believe) that removed the GIL, replacing it with fine-grained locks on all mutable data structures. [...] However, after benchmarking, it was shown that even on the platform with the fastest locking primitive (Windows at the time) it slowed down single-threaded execution nearly two-fold.
Guido also referenced this write-up from Greg: https://mail.python.org/pipermail/python-dev/2001-August/017099.html
I hope this helps.
-- Jonathan
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On a related note, there was a patch that I’d written for Python 3.6 to store code objects in the read only segment of the interpreter binary for faster interpreter startup. I’d sent the patch to Larry Hastings, who graciously ported it to Python 3.8 and posted it on bpo[1]. - Jeethu [1]: https://bugs.python.org/issue34690
On 21 Jun 2020, at 00:47, Guido van Rossum <guido@python.org> wrote:
Hm, I remember Greg's free threading too, but that's not the idea I was trying to recall this time. There really was something about bytecode objects being loaded from a read-only segment to speed up code loading. (Much quicker than unmarshalling a .pyc file.) I don't think we ever got the details worked out to the point where we could benchmark.
On Sat, Jun 20, 2020 at 4:57 AM Jonathan Fine <jfine2358@gmail.com> wrote:
Hi All
Guido wrote:
I remember vaguely that about two decades ago Greg Stein hatched an idea for code objects loaded from a read-only segment in shared libraries.
[Thank you for this, Guido. Your memory is good.]
Here's a thread from 2009, where Guido said: Greg Stein reached this same conclusion (and similar numbers) over 10 years ago ...
Subject: Remove GIL with CAS instructions? https://mail.python.org/archives/list/python-ideas@python.org/thread/6ZONFLM...
I looked up https://en.wikipedia.org/wiki/Compare-and-swap to read about CAS.
Guido said this in the context of Antione's statement: Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL.
In 2007 Guido posted to Artima: It isn't Easy to Remove the GIL: https://www.artima.com/weblogs/viewpost.jsp?thread=214235
In this post Guido writes: In 1999 Greg Stein (with Mark Hammond?) produced a fork of Python (1.5 I believe) that removed the GIL, replacing it with fine-grained locks on all mutable data structures. [...] However, after benchmarking, it was shown that even on the platform with the fastest locking primitive (Windows at the time) it slowed down single-threaded execution nearly two-fold.
Guido also referenced this write-up from Greg: https://mail.python.org/pipermail/python-dev/2001-August/017099.html
I hope this helps.
-- Jonathan
--
--Guido van Rossum (python.org/~guido) Pronouns: he/him [(why is my pronoun here?)](http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...)
On 21.06.2020 01:47, Guido van Rossum wrote:
Hm, I remember Greg's free threading too, but that's not the idea I was trying to recall this time. There really was something about bytecode objects being loaded from a read-only segment to speed up code loading. (Much quicker than unmarshalling a .pyc file.) I don't think we ever got the details worked out to the point where we could benchmark.
Perhaps you are thinking about reading byte code from C arrays, as is done when freezing Python modules. I'm using this logic in PyRun to freeze (more or less) the entire Python stdlib. It does have some effect on startup time and due to the OS typically sharing such static code segments between processes helps a bit when you run multiple processes, but since the C arrays only store byte code and not code objects, you still need to create code objects and store a copy of the byte code in memory for every single process. This could probably be optimized by having the code object point to the C array to hold the byte code data, but I haven't looked into that. I mainly use PyRun for packaging Python based products for distribution on Linux and as a replacement for virtualenv which doesn't rely on system Python installations, not so much to speed up anything.
On Sat, Jun 20, 2020 at 4:57 AM Jonathan Fine <jfine2358@gmail.com <mailto:jfine2358@gmail.com>> wrote:
Hi All
Guido wrote:
I remember vaguely that about two decades ago Greg Stein hatched an idea for code objects loaded from a read-only segment in shared libraries.
[Thank you for this, Guido. Your memory is good.]
Here's a thread from 2009, where Guido said: Greg Stein reached this same conclusion (and similar numbers) over 10 years ago ...
Subject: Remove GIL with CAS instructions? https://mail.python.org/archives/list/python-ideas@python.org/thread/6ZONFLM...
I looked up https://en.wikipedia.org/wiki/Compare-and-swap to read about CAS.
Guido said this in the context of Antione's statement: Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL.
In 2007 Guido posted to Artima: It isn't Easy to Remove the GIL: https://www.artima.com/weblogs/viewpost.jsp?thread=214235
In this post Guido writes: In 1999 Greg Stein (with Mark Hammond?) produced a fork of Python (1.5 I believe) that removed the GIL, replacing it with fine-grained locks on all mutable data structures. [...] However, after benchmarking, it was shown that even on the platform with the fastest locking primitive (Windows at the time) it slowed down single-threaded execution nearly two-fold.
Guido also referenced this write-up from Greg: https://mail.python.org/pipermail/python-dev/2001-August/017099.html
I hope this helps.
-- Jonathan
-- --Guido van Rossum (python.org/~guido <http://python.org/~guido>) /Pronouns: he/him //(why is my pronoun here?)/ <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7XTH3P... Code of Conduct: http://python.org/psf/codeofconduct/
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 21 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
On Sun, Jun 21, 2020 at 02:53 M.-A. Lemburg <mal@egenix.com> wrote:
On 21.06.2020 01:47, Guido van Rossum wrote:
Hm, I remember Greg's free threading too, but that's not the idea I was trying to recall this time. There really was something about bytecode objects being loaded from a read-only segment to speed up code loading. (Much quicker than unmarshalling a .pyc file.) I don't think we ever got the details worked out to the point where we could benchmark.
Perhaps you are thinking about reading byte code from C arrays, as is done when freezing Python modules.
I'm using this logic in PyRun to freeze (more or less) the entire Python stdlib.
It does have some effect on startup time and due to the OS typically sharing such static code segments between processes helps a bit when you run multiple processes, but since the C arrays only store byte code and not code objects, you still need to create code objects and store a copy of the byte code in memory for every single process.
This could probably be optimized by having the code object point to the C array to hold the byte code data, but I haven't looked into that.
I believe this was what Greg Stein's idea here was about. (As well as Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object. --Guido -- --Guido (mobile)
On Mon, Jun 22, 2020 at 12:00 AM Guido van Rossum <guido@python.org> wrote:
I believe this was what Greg Stein's idea here was about. (As well as Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object.
memoryview is heavy object. Using memoryview instead of bytes object will increase memory usage. I think lightweight bytes-like object is better. My rough idea is: * New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes. * code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Mon, Jun 22, 2020 at 5:19 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think lightweight bytes-like object is better. My rough idea is:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes. * code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment.
With exec, it's possible to create a function that doesn't have any corresponding pyc or module. Would functions and code objects need to cope with both this style and the current model? ChrisA
On 22 Jun 2020, at 08:15, Inada Naoki <songofacandy@gmail.com> wrote:
On Mon, Jun 22, 2020 at 12:00 AM Guido van Rossum <guido@python.org> wrote:
I believe this was what Greg Stein's idea here was about. (As well as Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object.
memoryview is heavy object. Using memoryview instead of bytes object will increase memory usage. I think lightweight bytes-like object is better. My rough idea is:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes.
Would higher alignment help? malloc is use 8 or 16 byte alignment isn't it? Would that better load the code into cache lines?
* code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment.
Regards, -- Inada Naoki <songofacandy@gmail.com> ____________________________________
Barry
___________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VKBXY7... Code of Conduct: http://python.org/psf/codeofconduct/
Let's try again...
On 22 Jun 2020, at 08:15, Inada Naoki <songofacandy@gmail.com> wrote:
On Mon, Jun 22, 2020 at 12:00 AM Guido van Rossum <guido@python.org> wrote:
I believe this was what Greg Stein's idea here was about. (As well as Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object.
memoryview is heavy object. Using memoryview instead of bytes object will increase memory usage. I think lightweight bytes-like object is better. My rough idea is:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes.
Would higher alignment help? malloc is using 8 or 16 byte alignment isn't it? Would that be better for packing the byte code into cache lines?
* code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment.
Regards, -- Inada Naoki <songofacandy@gmail.com>
Barry
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VKBXY7... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Jun 22, 2020 at 8:27 PM Barry Scott <barry@barrys-emacs.org> wrote:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes.
Would higher alignment help? malloc is using 8 or 16 byte alignment isn't it? Would that be better for packing the byte code into cache lines?
It may. But I am not sure. I said "at least 2 byte" because we use "word code". We read the "word code" by `uint16_t *`, not `unsignet char *`.
If you want to proceed in this direction, it would be better to do some more research into current CPU architectures and then build a VM optimized byte code storage object, which is well aligned, fits into today's caches and improves locality. freeze.py could then write out this format as well, so that the object can directly point to the structure and have the OS deal with memory mapping and sharing byte code across processes. I've had some good performance increases when I used the above approach in the mxTextTools tagging engine, which is a low-level VM for tagging and searching in text. If compilers are made aware of the structure the VM will use, they may also be able to apply additional optimizations for faster byte code access, prefetching, etc. On 22.06.2020 09:15, Inada Naoki wrote:
On Mon, Jun 22, 2020 at 12:00 AM Guido van Rossum <guido@python.org> wrote:
I believe this was what Greg Stein's idea here was about. (As well as Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object.
memoryview is heavy object. Using memoryview instead of bytes object will increase memory usage. I think lightweight bytes-like object is better. My rough idea is:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes. * code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment.
Regards,
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 22 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
I like where this is going. It would be nice if certain constants could also be loaded from RO memory. On Mon, Jun 22, 2020 at 00:16 Inada Naoki <songofacandy@gmail.com> wrote:
On Mon, Jun 22, 2020 at 12:00 AM Guido van Rossum <guido@python.org> wrote:
I believe this was what Greg Stein's idea here was about. (As well as
Jonathan Fine's in this thread?) But the current use of code objects makes this hard. Perhaps the code objects could have a memoryview object to hold the bytecode instead of a bytes object.
memoryview is heavy object. Using memoryview instead of bytes object will increase memory usage. I think lightweight bytes-like object is better. My rough idea is:
* New code and pyc format * pyc has "rodata" segment * It can be copied into single memory block, or can be mmapped. * co_code should be aligned at least 2 bytes. * code.co_code can point to memory block in "rodata". * docstring, signature, and lnotab can be lazy load from "rodata". * signature is serialized in JSON like format. * Allow multiple modules in single file * Reduce fileno when using mmap * Merge more constants * New Python object: PyROData. * It is like read-only bytearray. * But body may be mmap-ped, instead of malloc-ed * code objects owns reference of PyROData. * When PyROData is deallocated, it munmap or free "rodata" segment.
Regards, -- Inada Naoki <songofacandy@gmail.com>
-- --Guido (mobile)
Hi SUMMARY: We're starting to discuss implementation. I'm going to focus on what can be done, with only a few changes to the interpreter. First consider this: >>> from sys import getrefcount as grc >>> def fn(obj): return grc(obj) >>> grc(fn.__code__), grc(fn.__code__.co_code) (2, 2) >>> fn(fn.__code__), fn(fn.__code__.co_code) (5, 4) >>> grc(fn.__code__), grc(fn.__code__.co_code) (2, 2) # This is the bytecode. >>> fn.__code__.co_code b't\x00|\x00\x83\x01S\x00' What's happening here? While the interpreter executes the pure Python function fn, it changes the refcount of both fn.__code__ and fn.__code__.co_code. This is one of the problems we have to solve, to make progress. These refcounts are stored in the objects themselves. So unless the interpreter is changed, these Python objects can't be stored in read-only memory. We may have to change code and co_code objects also. Let's focus on the bytecode, as it's the busiest and often largest part of the code object. The (ordinary) code object has a field which is (a pointer to) the co_code attribute, which is a Python bytes object. This is the bytecode, as a Python object. Let's instead give the C implementation of fn.__code__ TWO fields. The first is a pointer, as usual, to the co_code attribute of the code object. The second is a pointer to the raw data of the co_code object. When the interpreter executes the code object, the second field tells the interpreter where to start executing. (This might be why the refcount of fn.__code__.co_code is incremented during the execution of fn.) The interpreter doesn't even have to look at the first field. If we want the raw bytecode of a code object to lie in read-only memory, it is enough to set the second pointer to that location. In both cases, the interpreter reads the memory location of the raw bytecode and executes accordingly. This leaves the problem of the first field. At present, it can only be a bytes object. When the raw bytecode is in read-only memory, we need a second sort of object. It's purpose is to 'do the right thing'. Let's call this sort of object perma_bytes. It's like a bytes object, except the data is stored elsewhere, in read-only permanent storage. Aside: If the co_code attribute of a code object is ordinary bytes - not perma_bytes -- then the two pointer addresses differ by a constant, namely the size of the header of a Python bytes object. Any Python language operation on perma_bytes is done by performing the same Python operation on bytes, but on the raw data that is pointed to. (That raw data had better still be there, otherwise chaos or worse will result.) So what have we gained, and what have we lost. LOST: 1. fn.code object bigger by the size of a pointer. 2. Added perma_bytes object. GAINED: 1. Can store co_code data in read-only permanent storage. 2. Bytes on fn.__code__.co_code objects are slower. 3. perma_bytes might be useful elsewhere. It may be possible to improve the outcome, by making more changes to the interpreter. I don't see a way of getting a useful outcome, by making fewer. Here's another way of looking at things. If all the refcounts were stored in a single array, and the data stored elsewhere, the changing refcount wouldn't be a problem. Using perma_bytes allows the refount and the data to be stored at different locations, thereby avoiding the refcount problem! I hope this is clear enough, and that it helps. And that it is correct. I'll let T. S. Eliot have the last word: https://faculty.washington.edu/smcohen/453/NamingCats.html The Naming of Cats is a difficult matter, It isn’t just one of your holiday games; You may think at first I’m as mad as a hatter When I tell you, a cat must have THREE DIFFERENT NAMES. We're giving the raw data TWO DIFFERENT POINTERS. with best wishes Jonathan
participants (13)
-
Antoine Pitrou
-
Barry Scott
-
Chris Angelico
-
Christopher Barker
-
Greg Ewing
-
Guido van Rossum
-
Inada Naoki
-
Jeethu Rao
-
Jonathan Fine
-
Jonathan Goble
-
M.-A. Lemburg
-
Richard Damon
-
Steven D'Aprano