[Python-Dev] CPython optimization: storing reference counters outside of objects

Sun May 22 01:57:55 CEST 2011

Hi.
The problem with reference counters is that they are very often
incremented/decremented, even for read-only algorithms (like traversal
of a list). It has two drawbacks:
1. CPU cache lines (64 bytes on X86) containing a beginning of a
PyObject are very often invalidated, resulting in loosing many chances
to use the CPU caches
2. The copy-on-write after fork() optimization (Linux) is almost
useless in CPython, because even if you don't modify data directly,
refcounts are modified, and PyObjects with refcounts inside are spread
all over process' memory (and one small refcount modification causes
the whole page - 4kB - to be copied into a child process).

So an idea I would like to try is to move reference counts outside of
PyObjects, to a contiguous block(s) of memory. PyObjects would have a
pointer to a reference count inside this block. Doing this I think
that
1. The beginning of PyObject structs could be CPU-cached for a much
longer time (small objects like ints could be fully cached). I don't
know if having localized writes into the block with refcounts also
help performance?
2. copy-on-write after fork() will work much better, only the block
with refcounts would be copied into a child process (for read-only
algorithms)

However the drawback is that such design introduces a new level of
indirection which is a pointer inside a PyObject instead of a direct
value. Also it seems that the "block" with refcounts would have to be
a non-trivial data structure.

I'm not a compiler/profiling expert so the main question is if such
design can work, and maybe someone was thinking about something
similar? And if CPython was profiled for CPU cache usage?