<div><br><div><br><div class="gmail_quote">On Mon, Mar 12, 2012 at 5:10 PM, Larry Hastings <span dir="ltr"><<a href="mailto:larry@hastings.org">larry@hastings.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

I'll admit in advance that this is in all likelyhood a terrible idea.  What I'm curious about is why it wouldn't work, or if it wouldn't even help ;-)<br>

<br>

One problem for CPython is that it can't share data across processes very often.  If you have an application server, and you fork a hundred processes to handle requests, your memory use will be "C * n * p" where C is a constant, n is the number of processes, and p is the average memory consumption of your app.  I fear C is very nearly near 1.0.  Most of Python's memory usage is on the heap, and Python uses its memory to store objects, and objects are reference counted, and reference counts change.  So all the COW data pages get written to sooner or later.<br>


</blockquote><div><br></div><div>Despite me really disliking anything that fork()s these days and generally not using fork anymore... I have been pondering this one on and off over the years as well, it could help with people using the fork()ing variant of multiprocessing (ie: its default today).<div>


<br></div><div>If reference counts were moved out of the PyObject structure into a region of memory allocated specifically for reference counts, only those pages would need copying rather than virtually every random page of memory containing a PyObject.  My initial thought was to do this by turning the existing refcount field into a pointer to the object's count or an array reference that code managing the reference count array would use to manipulate the count.  Obviously either of these would have some performance impact and break the ABI.</div>


<div><br></div><div>Some practical real-world-ish forking server and multiprocessing computation memory usage benchmarks need to be put together to measure the impact of any work on that.</div><div><br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

The obvious first step: add a magical reference count number that never changes, called Py_REF_ETERNAL.  I added this to CPython trunk with a quick hack.  It seemed to work; I threw in some asserts to test it, which passed, and it was passing the unit test suite.<br>


<br>

I discussed this with Martin who (as usual) made some excellent points.  Martin suggests that this wouldn't help unless we could concentrate the Py_REF_ETERNAL objects in their own memory pools in the small block allocator.  Otherwise we'd never get a page that didn't get written to sooner or later.<br>


<br>

Obviously all interned strings could get Py_REF_ETERNAL.  A slightly more controversial idea: mark code objects (but only those that get unmarshaled, says Martin!) as Py_REF_ETERNAL too.  Yes, you can unload code from sys.modules, but in practice if you ever import something you never throw it away for the life of the process.  If we went this route we could probably mark most (all?) immutable objects that get unmarshaled with Py_REF_ETERNAL.<br>


</blockquote><div><br></div><div>You have this working?  neat.  I toyed with making a magic value (-1 or -2 or something) mean "infinite" or "eternal" for ref counts a few years ago but things were crashing and I really didn't feel like trying to debug that one.  It makes sense for any intern()'ed string to be set to eternal.</div>


<div><br></div><div>If you know which objects will be eternal or not at allocation time, clustering them into different pages makes a lot of sense but I don't believe we express that meaningfully in our code today.</div>


<div><br></div><div>-gps</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Martin's statistics from writing the flexible string representation says that for a toy Django app, memory consumption is mostly strings, and most strings are short (< 16 or even < 8 bytes)... in other words, identifiers.  So if you ran 100 toy Django instances it seems likely this would help!<br>


<br>

And no I haven't benchmarked it,<br>

<br>

<br>

/arry<br>

______________________________<u></u>_________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>

<a href="http://mail.python.org/mailman/listinfo/python-ideas" target="_blank">http://mail.python.org/<u></u>mailman/listinfo/python-ideas</a><br>

</blockquote></div><br></div></div>