Save memory when forking with *really* immutable objects
I'll admit in advance that this is in all likelyhood a terrible idea. What I'm curious about is why it wouldn't work, or if it wouldn't even help ;-) One problem for CPython is that it can't share data across processes very often. If you have an application server, and you fork a hundred processes to handle requests, your memory use will be "C * n * p" where C is a constant, n is the number of processes, and p is the average memory consumption of your app. I fear C is very nearly near 1.0. Most of Python's memory usage is on the heap, and Python uses its memory to store objects, and objects are reference counted, and reference counts change. So all the COW data pages get written to sooner or later. The obvious first step: add a magical reference count number that never changes, called Py_REF_ETERNAL. I added this to CPython trunk with a quick hack. It seemed to work; I threw in some asserts to test it, which passed, and it was passing the unit test suite. I discussed this with Martin who (as usual) made some excellent points. Martin suggests that this wouldn't help unless we could concentrate the Py_REF_ETERNAL objects in their own memory pools in the small block allocator. Otherwise we'd never get a page that didn't get written to sooner or later. Obviously all interned strings could get Py_REF_ETERNAL. A slightly more controversial idea: mark code objects (but only those that get unmarshaled, says Martin!) as Py_REF_ETERNAL too. Yes, you can unload code from sys.modules, but in practice if you ever import something you never throw it away for the life of the process. If we went this route we could probably mark most (all?) immutable objects that get unmarshaled with Py_REF_ETERNAL. Martin's statistics from writing the flexible string representation says that for a toy Django app, memory consumption is mostly strings, and most strings are short (< 16 or even < 8 bytes)... in other words, identifiers. So if you ran 100 toy Django instances it seems likely this would help! And no I haven't benchmarked it, /arry
On Mon, Mar 12, 2012 at 5:10 PM, Larry Hastings <larry@hastings.org> wrote:
I'll admit in advance that this is in all likelyhood a terrible idea. What I'm curious about is why it wouldn't work, or if it wouldn't even help ;-)
One problem for CPython is that it can't share data across processes very often. If you have an application server, and you fork a hundred processes to handle requests, your memory use will be "C * n * p" where C is a constant, n is the number of processes, and p is the average memory consumption of your app. I fear C is very nearly near 1.0. Most of Python's memory usage is on the heap, and Python uses its memory to store objects, and objects are reference counted, and reference counts change. So all the COW data pages get written to sooner or later.
Despite me really disliking anything that fork()s these days and generally not using fork anymore... I have been pondering this one on and off over the years as well, it could help with people using the fork()ing variant of multiprocessing (ie: its default today). If reference counts were moved out of the PyObject structure into a region of memory allocated specifically for reference counts, only those pages would need copying rather than virtually every random page of memory containing a PyObject. My initial thought was to do this by turning the existing refcount field into a pointer to the object's count or an array reference that code managing the reference count array would use to manipulate the count. Obviously either of these would have some performance impact and break the ABI. Some practical real-world-ish forking server and multiprocessing computation memory usage benchmarks need to be put together to measure the impact of any work on that.
The obvious first step: add a magical reference count number that never changes, called Py_REF_ETERNAL. I added this to CPython trunk with a quick hack. It seemed to work; I threw in some asserts to test it, which passed, and it was passing the unit test suite.
I discussed this with Martin who (as usual) made some excellent points. Martin suggests that this wouldn't help unless we could concentrate the Py_REF_ETERNAL objects in their own memory pools in the small block allocator. Otherwise we'd never get a page that didn't get written to sooner or later.
Obviously all interned strings could get Py_REF_ETERNAL. A slightly more controversial idea: mark code objects (but only those that get unmarshaled, says Martin!) as Py_REF_ETERNAL too. Yes, you can unload code from sys.modules, but in practice if you ever import something you never throw it away for the life of the process. If we went this route we could probably mark most (all?) immutable objects that get unmarshaled with Py_REF_ETERNAL.
You have this working? neat. I toyed with making a magic value (-1 or -2 or something) mean "infinite" or "eternal" for ref counts a few years ago but things were crashing and I really didn't feel like trying to debug that one. It makes sense for any intern()'ed string to be set to eternal. If you know which objects will be eternal or not at allocation time, clustering them into different pages makes a lot of sense but I don't believe we express that meaningfully in our code today. -gps
Martin's statistics from writing the flexible string representation says that for a toy Django app, memory consumption is mostly strings, and most strings are short (< 16 or even < 8 bytes)... in other words, identifiers. So if you ran 100 toy Django instances it seems likely this would help!
And no I haven't benchmarked it,
/arry ______________________________**_________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/**mailman/listinfo/python-ideas<http://mail.python.org/mailman/listinfo/python-ideas>
On 13.03.2012 05:49, Gregory P. Smith wrote:
Despite me really disliking anything that fork()s these days and generally not using fork anymore... I have been pondering this one on and off over the years as well, it could help with people using the fork()ing variant of multiprocessing (ie: its default today).
If reference counts were moved out of the PyObject structure into a region of memory allocated specifically for reference counts, only those pages would need copying rather than virtually every random page of memory containing a PyObject. My initial thought was to do this by turning the existing refcount field into a pointer to the object's count or an array reference that code managing the reference count array would use to manipulate the count. Obviously either of these would have some performance impact and break the ABI.
Some practical real-world-ish forking server and multiprocessing computation memory usage benchmarks need to be put together to measure the impact of any work on that.
This looks like dalvik VM in android. They do many things to preserve memory when forking. HTH Niki
On Mar 12, 2012, at 08:49 PM, Gregory P. Smith wrote:
If reference counts were moved out of the PyObject structure into a region of memory allocated specifically for reference counts, only those pages would need copying rather than virtually every random page of memory containing a PyObject. My initial thought was to do this by turning the existing refcount field into a pointer to the object's count or an array reference that code managing the reference count array would use to manipulate the count. Obviously either of these would have some performance impact and break the ABI.
It's been *ages* since I really knew how any of this worked, but I think some flavor of the Objective-C runtime did reference counting this way. I think this afforded them other tricks, like the ability to not increment the refcount for an object if it was exactly 1. I've no doubt someone here will fill in all my faulty memory and gaps, but I do seem to recall it being a pretty efficient system for memory management. Cheers, -Barry
On Fri, Mar 23, 2012 at 10:40 AM, Barry Warsaw <barry@python.org> wrote:
It's been *ages* since I really knew how any of this worked, but I think some flavor of the Objective-C runtime did reference counting this way. I think this afforded them other tricks, like the ability to not increment the refcount for an object if it was exactly 1. I've no doubt someone here will fill in all my faulty memory and gaps, but I do seem to recall it being a pretty efficient system for memory management.
Also from the world of "hazy memories of old discussions", my recollection is that the two main problems with the indirection are: - an extra pointer indirection for every refcounting operation (which are frequent enough that the micro-pessimisation has a measurable effect) - some loss of cache locality (since every Python object will need both its own memory and its refcount memory in the cache) Larry's suggestion for allowing eternal objects avoids the latter problem, but still suffers from (a variant of) the first. As the many GIL discussions can attest, we're generally very reluctant to accept a single-threaded (or, in this case, single process) performance hit to improve behaviour in the concurrent case. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Mon, 12 Mar 2012 17:10:28 -0700 Larry Hastings <larry@hastings.org> wrote:
Martin's statistics from writing the flexible string representation says that for a toy Django app, memory consumption is mostly strings, and most strings are short (< 16 or even < 8 bytes)... in other words, identifiers. So if you ran 100 toy Django instances it seems likely this would help!
How many MB do you save on a real app, though? By the way, "short strings are identifiers" is a fallacy.
And no I haven't benchmarked it,
Well, you should. Regards Antoine.
On Mon, Mar 12, 2012 at 8:10 PM, Larry Hastings <larry@hastings.org> wrote:
The obvious first step: add a magical reference count number that never changes, called Py_REF_ETERNAL.
If you have a magic number, you need to check before doing the update; at some point in the distant past, that was considered too expensive because it is done so often. But once you do pay the cost of a more expensive refcount update, this isn't the only optimization available. For example, the incref/decref can be delayed or batched up, which can help with remote objects or incremental garbage collection. Gating reference acquisition may also be re-purposed to serve as thread-locking, or to more efficiently support Software Transactional Memory.
Martin suggests that this wouldn't help unless we could concentrate the Py_REF_ETERNAL objects in their own memory pools in the small block allocator.
Right; it makes sense to have the incref/decref function be per-arena, or at least per page or some such. -jJ
Jim Jewett, 14.03.2012 23:07:
On Mon, Mar 12, 2012 at 8:10 PM, Larry Hastings wrote:
The obvious first step: add a magical reference count number that never changes, called Py_REF_ETERNAL.
If you have a magic number, you need to check before doing the update; at some point in the distant past, that was considered too expensive because it is done so often.
Well, we could switch to a floating point value for the refcount and let the CPU do it for us by using +inf as magic value. (this is python-ideas, right?) Stefan
For those who have troubles understanding what all this memory pages stuff is about - here is a good intro "Python, Linkers, and Virtual Memory" by Brandon Rhodes http://www.youtube.com/watch?v=twQKAoq2OPE -- anatoly t.
participants (9)
-
anatoly techtonik
-
Antoine Pitrou
-
Barry Warsaw
-
Gregory P. Smith
-
Jim Jewett
-
Larry Hastings
-
Nick Coghlan
-
Niki Spahiev
-
Stefan Behnel