[Python-ideas] Save memory when forking with *really* immutable objects
Larry Hastings
larry at hastings.org
Tue Mar 13 01:10:28 CET 2012
I'll admit in advance that this is in all likelyhood a terrible idea.
What I'm curious about is why it wouldn't work, or if it wouldn't even
help ;-)
One problem for CPython is that it can't share data across processes
very often. If you have an application server, and you fork a hundred
processes to handle requests, your memory use will be "C * n * p" where
C is a constant, n is the number of processes, and p is the average
memory consumption of your app. I fear C is very nearly near 1.0. Most
of Python's memory usage is on the heap, and Python uses its memory to
store objects, and objects are reference counted, and reference counts
change. So all the COW data pages get written to sooner or later.
The obvious first step: add a magical reference count number that never
changes, called Py_REF_ETERNAL. I added this to CPython trunk with a
quick hack. It seemed to work; I threw in some asserts to test it,
which passed, and it was passing the unit test suite.
I discussed this with Martin who (as usual) made some excellent points.
Martin suggests that this wouldn't help unless we could concentrate the
Py_REF_ETERNAL objects in their own memory pools in the small block
allocator. Otherwise we'd never get a page that didn't get written to
sooner or later.
Obviously all interned strings could get Py_REF_ETERNAL. A slightly
more controversial idea: mark code objects (but only those that get
unmarshaled, says Martin!) as Py_REF_ETERNAL too. Yes, you can unload
code from sys.modules, but in practice if you ever import something you
never throw it away for the life of the process. If we went this route
we could probably mark most (all?) immutable objects that get
unmarshaled with Py_REF_ETERNAL.
Martin's statistics from writing the flexible string representation says
that for a toy Django app, memory consumption is mostly strings, and
most strings are short (< 16 or even < 8 bytes)... in other words,
identifiers. So if you ran 100 toy Django instances it seems likely
this would help!
And no I haven't benchmarked it,
/arry
More information about the Python-ideas
mailing list