[Python-ideas] Save memory when forking with *really* immutable objects

Larry Hastings larry at hastings.org
Tue Mar 13 01:10:33 CET 2012


I'll admit in advance that this is in all likelyhood a terrible idea.  
What I'm curious about is why it wouldn't work, or if it wouldn't even 
help ;-)

One problem for CPython is that it can't share data across processes 
very often.  If you have an application server, and you fork a hundred 
processes to handle requests, your memory use will be "C * n * p" where 
C is a constant, n is the number of processes, and p is the average 
memory consumption of your app.  I fear C is very nearly near 1.0.  Most 
of Python's memory usage is on the heap, and Python uses its memory to 
store objects, and objects are reference counted, and reference counts 
change.  So all the COW data pages get written to sooner or later.

The obvious first step: add a magical reference count number that never 
changes, called Py_REF_ETERNAL.  I added this to CPython trunk with a 
quick hack.  It seemed to work; I threw in some asserts to test it, 
which passed, and it was passing the unit test suite.

I discussed this with Martin who (as usual) made some excellent points.  
Martin suggests that this wouldn't help unless we could concentrate the 
Py_REF_ETERNAL objects in their own memory pools in the small block 
allocator.  Otherwise we'd never get a page that didn't get written to 
sooner or later.

Obviously all interned strings could get Py_REF_ETERNAL.  A slightly 
more controversial idea: mark code objects (but only those that get 
unmarshaled, says Martin!) as Py_REF_ETERNAL too.  Yes, you can unload 
code from sys.modules, but in practice if you ever import something you 
never throw it away for the life of the process.  If we went this route 
we could probably mark most (all?) immutable objects that get 
unmarshaled with Py_REF_ETERNAL.

Martin's statistics from writing the flexible string representation says 
that for a toy Django app, memory consumption is mostly strings, and 
most strings are short (< 16 or even < 8 bytes)... in other words, 
identifiers.  So if you ran 100 toy Django instances it seems likely 
this would help!

And no I haven't benchmarked it,


/arry



More information about the Python-ideas mailing list