CPython optimization: storing reference counters outside of objects
![](https://secure.gravatar.com/avatar/97ea1e16c6d8cf73612d83dc12ca1011.jpg?s=120&d=mm&r=g)
Hi. The problem with reference counters is that they are very often incremented/decremented, even for read-only algorithms (like traversal of a list). It has two drawbacks: 1. CPU cache lines (64 bytes on X86) containing a beginning of a PyObject are very often invalidated, resulting in loosing many chances to use the CPU caches 2. The copy-on-write after fork() optimization (Linux) is almost useless in CPython, because even if you don't modify data directly, refcounts are modified, and PyObjects with refcounts inside are spread all over process' memory (and one small refcount modification causes the whole page - 4kB - to be copied into a child process). So an idea I would like to try is to move reference counts outside of PyObjects, to a contiguous block(s) of memory. PyObjects would have a pointer to a reference count inside this block. Doing this I think that 1. The beginning of PyObject structs could be CPU-cached for a much longer time (small objects like ints could be fully cached). I don't know if having localized writes into the block with refcounts also help performance? 2. copy-on-write after fork() will work much better, only the block with refcounts would be copied into a child process (for read-only algorithms) However the drawback is that such design introduces a new level of indirection which is a pointer inside a PyObject instead of a direct value. Also it seems that the "block" with refcounts would have to be a non-trivial data structure. I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar? And if CPython was profiled for CPU cache usage?
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
Hello, On Sun, 22 May 2011 01:57:55 +0200 Artur Siekielski <artur.siekielski@gmail.com> wrote:
1. CPU cache lines (64 bytes on X86) containing a beginning of a PyObject are very often invalidated, resulting in loosing many chances to use the CPU caches
Mutating data doesn't invalidate a cache line. It just makes it necessary to write it back to memory at some point.
2. The copy-on-write after fork() optimization (Linux) is almost useless in CPython, because even if you don't modify data directly, refcounts are modified, and PyObjects with refcounts inside are spread all over process' memory (and one small refcount modification causes the whole page - 4kB - to be copied into a child process).
Indeed.
I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar? And if CPython was profiled for CPU cache usage?
This has already been proposed a couple of times. I guess what's needed is for someone to experiment and post benchmark results. Regards Antoine.
![](https://secure.gravatar.com/avatar/baf8cffb9d3cda79a3df8f0a6f7b5590.jpg?s=120&d=mm&r=g)
1. CPU cache lines (64 bytes on X86) containing a beginning of a PyObject are very often invalidated, resulting in loosing many chances to use the CPU caches
Mutating data doesn't invalidate a cache line. It just makes it necessary to write it back to memory at some point.
I think he's referring to the multi-core case. In MESI terminology, the cache line will become modified in the current cache (current thread), but invalid in other cores' caches. But given that objects are accessed serialized by the GIL (which will issue a memory barrier anyway), I'm not sure that the performance impact will be noticeable. Furthermore, given that threads are actually serialized, I suspect that the scheduler tends to bind them naturally to the same CPU.
2. The copy-on-write after fork() optimization (Linux) is almost useless in CPython, because even if you don't modify data directly, refcounts are modified, and PyObjects with refcounts inside are spread all over process' memory (and one small refcount modification causes the whole page - 4kB - to be copied into a child process).
Indeed.
There's been a bug report a couple months ago from someone using large datasets for some scientific application. He was suggesting to add support for Linux's MADV_MERGEABLE, but the root cause is really the reference count being incremented even when objects are treated read-only. For the record, it's http://bugs.python.org/issue9942 (and this idea was brought up here). cf
![](https://secure.gravatar.com/avatar/3acb8bae5a2b5a28f6fe522a4ea9b873.jpg?s=120&d=mm&r=g)
I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar?
My expectation is that your approach would likely make the issues worse in a multi-CPU setting. If you put multiple reference counters into a contiguous block of memory, unrelated reference counters will live in the same cache line. Consequentially, changing one reference counter on one CPU will invalidate the cached reference counters of that cache line on other CPU, making your problem a) actually worse. Regards, Martin
![](https://secure.gravatar.com/avatar/6a63f6b04556912b0763100879d5053d.jpg?s=120&d=mm&r=g)
2011/5/23 "Martin v. Löwis" <martin@v.loewis.de>
I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar?
My expectation is that your approach would likely make the issues worse in a multi-CPU setting. If you put multiple reference counters into a contiguous block of memory, unrelated reference counters will live in the same cache line. Consequentially, changing one reference counter on one CPU will invalidate the cached reference counters of that cache line on other CPU, making your problem a) actually worse.
Regards, Martin
I don't think that moving ob_refcnt to a proper memory pool will solve the problem of cache pollution anyway. ob_refcnt is obviously the most stressed field in PyObject, but it's not the only one. We have , that is needed to model each object (instance) "behavior", which is massively accessed too, so a cache line will be loaded as well when the object will be used. Also, only a few of simple objects have just ob_refcnt and ob_type. Most of them have other fields too, and accessing them means a line cache load. Regards, Cesare P.S. Memory allocation granularity can help sometimes, leaving some data (ob_refcnt and/or ob_type) on one cache line, and the other on the next one.
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 23.05.2011 06:59, skrev "Martin v. Löwis":
My expectation is that your approach would likely make the issues worse in a multi-CPU setting. If you put multiple reference counters into a contiguous block of memory, unrelated reference counters will live in the same cache line. Consequentially, changing one reference counter on one CPU will invalidate the cached reference counters of that cache line on other CPU, making your problem a) actually worse.
In a multi-threaded setting with concurrent thread accessing reference counts, this would certainly worsen the situation. In a single-threaded setting, this will likely be an improvement. CPython, however, has a GIL. Thus there is only one concurrently active thread with access to reference counts. On a thread switch in the interpreter, I think the performance result will depend on the nature of the Python code: If threads share a lot of objects, it could help to reduce the number of dirty cache lines. If threads mainly work on private objects, it would likely have the effect you predict. Which will dominate is hard to tell. Instead, we could use multiple heaps: Each Python thread could manage it's own heap for malloc and free (cf. HeapAlloc and HeapFree in Windows). Objects local to one thread only reside in the locally managed heap. When an object becomes shared by seveeral Python threads, it is moved from a local heap to the global heap of the process. Some objects, such as modules, would be stored directly onto the global heap. This way, objects only used by only one thread would never dirty cache lines used by other threads. This would also be a way to reduce the CPython dependency on the GIL. Only the global heap would need to be protected by the GIL, whereas the local heaps would not need any global synchronization. (I am setting follow-up to the Python Ideas list, it does not belong on Python dev.) Sturla Molden
![](https://secure.gravatar.com/avatar/97ea1e16c6d8cf73612d83dc12ca1011.jpg?s=120&d=mm&r=g)
Ok, I managed to make a quick but working patch (sufficient to get working interpreter, it segfaults for extension modules). It uses the "ememoa" allocator (http://code.google.com/p/ememoa/) which seems a reasonable pool allocator. The patch: http://dpaste.org/K8en/. The main obstacle was that there isn't a single function/macro that can be used to initialize all PyObjects, so I had to initialize static PyObjects (mainly PyTypeObjects) by hand. I used a naive quicksort algorithm as a benchmark: http://dpaste.org/qquh/ . The result is that after patching it runs 50% SLOWER. I profiled it and allocator methods used 35% time. So there is still 15% performance loss even if the allocator is poor. Anyway, I'd like to have working copy-on-write in CPython - in the presence of GIL I find it important to have multiprocess programs optimized (and I think it's a common idiom that a parent process prepares some big data structure, and child "worker" processes do some read-only quering). Artur
![](https://secure.gravatar.com/avatar/047f2332cde3730f1ed661eebb0c5686.jpg?s=120&d=mm&r=g)
On Mon, May 23, 2011 at 1:55 PM, Artur Siekielski <artur.siekielski@gmail.com> wrote:
Ok, I managed to make a quick but working patch (sufficient to get working interpreter, it segfaults for extension modules). It uses the "ememoa" allocator (http://code.google.com/p/ememoa/) which seems a reasonable pool allocator. The patch: http://dpaste.org/K8en/. The main obstacle was that there isn't a single function/macro that can be used to initialize all PyObjects, so I had to initialize static PyObjects (mainly PyTypeObjects) by hand.
I used a naive quicksort algorithm as a benchmark: http://dpaste.org/qquh/ . The result is that after patching it runs 50% SLOWER. I profiled it and allocator methods used 35% time. So there is still 15% performance loss even if the allocator is poor.
Anyway, I'd like to have working copy-on-write in CPython - in the presence of GIL I find it important to have multiprocess programs optimized (and I think it's a common idiom that a parent process prepares some big data structure, and child "worker" processes do some read-only quering).
That is the question though -- *is* the idiom commonly used? It doesn't seem to me that it would scale all that far, since it only works as long as all forked copies live on the same machine and run on the same symmetrical multi-core processor. -- --Guido van Rossum (python.org/~guido)
![](https://secure.gravatar.com/avatar/97ea1e16c6d8cf73612d83dc12ca1011.jpg?s=120&d=mm&r=g)
2011/5/23 Guido van Rossum <guido@python.org>:
Anyway, I'd like to have working copy-on-write in CPython - in the presence of GIL I find it important to have multiprocess programs optimized (and I think it's a common idiom that a parent process prepares some big data structure, and child "worker" processes do some read-only quering).
That is the question though -- *is* the idiom commonly used?
In fact I came to the whole idea with this optimization because the idiom didn't work for me. I had a big word index built by a parent process, and than wanted the children to enable querying this index (I wanted to use all cores on a server). The index consumed 50% of RAM and after a few minutes the children consumed all RAM. I find it common in languages like Java to use thread pools, in Python+Linux we have multiprocess pools if we want to use all cores, and in this setting having a working copy-on-write is really valuable. Oh, and using explicit shared memory or mmap is much harder, because you have to map the whole object graph into bytes.
It doesn't seem to me that it would scale all that far, since it only works as long as all forked copies live on the same machine and run on the same symmetrical multi-core processor.
? I don't know about multi-processor systems, but on single-processor multi-core systems (which are common even on servers) and Linux it works. Artur
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 24.05.2011 00:07, skrev Artur Siekielski:
Oh, and using explicit shared memory or mmap is much harder, because you have to map the whole object graph into bytes.
It sounds like you need PYRO, POSH or multiprocessing's proxy objects. Sturla
![](https://secure.gravatar.com/avatar/f3ba3ecffd20251d73749afbfa636786.jpg?s=120&d=mm&r=g)
On Tue, May 24, 2011 at 8:33 AM, Sturla Molden <sturla@molden.no> wrote:
Den 24.05.2011 00:07, skrev Artur Siekielski:
Oh, and using explicit shared memory or mmap is much harder, because you have to map the whole object graph into bytes.
It sounds like you need PYRO, POSH or multiprocessing's proxy objects.
Indeed. Abstractions over mmap (local machine sharing) and serialisation (remote sharing) are likely to be far more beneficial in this area than trying to change the underlying memory model to support copy-on-write. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
![](https://secure.gravatar.com/avatar/97ea1e16c6d8cf73612d83dc12ca1011.jpg?s=120&d=mm&r=g)
2011/5/24 Sturla Molden <sturla@molden.no>:
Oh, and using explicit shared memory or mmap is much harder, because you have to map the whole object graph into bytes.
It sounds like you need PYRO, POSH or multiprocessing's proxy objects.
PYRO/multiprocessing proxies isn't a comparable solution because of ORDERS OF MAGNITUDE worser performance. You compare here direct memory access vs serialization/message passing through sockets/pipes. POSH might be good, but the project is dead for 8 years. And this copy-on-write is nice because you don't need changes/restrictions to your code, or a special garbage collector. Artur
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 24.05.2011 11:55, skrev Artur Siekielski:
PYRO/multiprocessing proxies isn't a comparable solution because of ORDERS OF MAGNITUDE worser performance. You compare here direct memory access vs serialization/message passing through sockets/pipes.
The bottleneck is likely the serialization, but only if you serialize large objects. IPC is always very fast, at least on localhost . Just out of curiosity, have you considered using a database? Sqlite and BSD DB can even be put in shared memory if you want. It sounds like you are trying to solve a database problem using os.fork, something which is more or less doomed to fail (i.e. you have to replicate all effort put into scaling up databases). If a database is too slow, I am rather sure you need something else than Python as well. Sturla
![](https://secure.gravatar.com/avatar/97ea1e16c6d8cf73612d83dc12ca1011.jpg?s=120&d=mm&r=g)
2011/5/24 Sturla Molden <sturla@molden.no>:
Den 24.05.2011 11:55, skrev Artur Siekielski:
PYRO/multiprocessing proxies isn't a comparable solution because of ORDERS OF MAGNITUDE worser performance. You compare here direct memory access vs serialization/message passing through sockets/pipes.
The bottleneck is likely the serialization, but only if you serialize large objects. IPC is always very fast, at least on localhost .
It cannot be "fast" compared to direct memory access. Here is a benchmark: summing numbers in a small list in a child process using multiprocessing "manager": http://dpaste.org/QzKr/ , and using implicit copy of the structure after fork(): http://dpaste.org/q3eh/. The first is 200 TIMES SLOWER. It means if the work finishes in 20 seconds using fork(), the same work will require more than one hour using multiprocessing manager.
If a database is too slow, I am rather sure you need something else than Python as well.
Disk access is about 1000x slower than memory access in C, and Python in a worst case is 50x slower than C, so there is still a huge win (not to mention that in a common case Python is only a few times slower). Artur
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 24.05.2011 17:39, skrev Artur Siekielski:
Disk access is about 1000x slower than memory access in C, and Python in a worst case is 50x slower than C, so there is still a huge win (not to mention that in a common case Python is only a few times slower).
You can put databases in shared memory (e.g. Sqlite and BSDDB have options for this). On linux you can also mount /dev/shm as ramdisk. Also, why do you distrust the database developers of Oracle et al. not to do the suffient optimizations? Sturla
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 24.05.2011 11:55, skrev Artur Siekielski:
POSH might be good, but the project is dead for 8 years. And this copy-on-write is nice because you don't need changes/restrictions to your code, or a special garbage collector.
Then I have a solution for you, one that is cheaper than anything else you are trying to do (taking work hours into account): BUY MORE RAM! RAM is damn cheap. You just need more of it. And 64-bit Python :-) Sturla
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Sun, May 22, 2011 at 1:57 AM, Artur Siekielski <artur.siekielski@gmail.com> wrote:
Hi. The problem with reference counters is that they are very often incremented/decremented, even for read-only algorithms (like traversal of a list). It has two drawbacks: 1. CPU cache lines (64 bytes on X86) containing a beginning of a PyObject are very often invalidated, resulting in loosing many chances to use the CPU caches
Not sure what scenario exactly are you discussing here, but storing reference counts outside of objects has (at least on a single processor) worse cache locality than inside objects.
However the drawback is that such design introduces a new level of indirection which is a pointer inside a PyObject instead of a direct value. Also it seems that the "block" with refcounts would have to be a non-trivial data structure.
That would almost certainly be slower for most use cases, except for the copy-on-write fork. I guess recycler papers might be an interesting read: http://www.research.ibm.com/people/d/dfb/recycler.html This is the best reference-counting GC I'm aware of.
I'm not a compiler/profiling expert so the main question is if such design can work, and maybe someone was thinking about something similar? And if CPython was profiled for CPU cache usage?
CPython was not designed for CPU cache usage as far as I'm aware.
From my (heavily biased) point of view, PyPy is a way better platform to perform such experiments (and PyPy has been profiled for CPU cache usage). The main advantage is that you can code your GC without the need to modify the interpreter. On the other hand you obviously don't get benefits on CPython, but maybe it's worth experimenting.
Cheers, fijal
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Maciej Fijalkowski, 24.05.2011 13:31:
CPython was not designed for CPU cache usage as far as I'm aware.
That's a pretty bold statement to make on this list. Even if it wasn't originally "designed" for (efficient?) CPU cache usage, it's certainly been around for long enough to have received numerous performance tweaks in that regard. I doubt that efficient CPU cache usage was a major design goal of PyPy right from the start. IMHO, the project has changed its objectives way too many times to claim something like that, especially at the low level where the CPU cache becomes relevant. I remember that not so long ago, PyPy was hugely memory hungry compared to CPython. Although, one could certainly call *that* "designed for CPU cache usage"... ;) Stefan
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
On Tue, 24 May 2011 14:05:26 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
I doubt that efficient CPU cache usage was a major design goal of PyPy right from the start. IMHO, the project has changed its objectives way too many times to claim something like that, especially at the low level where the CPU cache becomes relevant. I remember that not so long ago, PyPy was hugely memory hungry compared to CPython. Although, one could certainly call *that* "designed for CPU cache usage"... ;)
Well, to be honest, "hugely memory hungry" doesn't necessarily mean cache-averse. It depends on the locality of memory access patterns. Regards Antoine.
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Antoine Pitrou, 24.05.2011 14:32:
On Tue, 24 May 2011 14:05:26 +0200Stefan Behnel wrote:
I doubt that efficient CPU cache usage was a major design goal of PyPy right from the start. IMHO, the project has changed its objectives way too many times to claim something like that, especially at the low level where the CPU cache becomes relevant. I remember that not so long ago, PyPy was hugely memory hungry compared to CPython. Although, one could certainly call *that* "designed for CPU cache usage"... ;)
Well, to be honest, "hugely memory hungry" doesn't necessarily mean cache-averse. It depends on the locality of memory access patterns.
Sure. AFAIR (and Maciej is certainly the right person to prove me wrong), the problem at the time was that the overall memory footprint of objects was too high. That, at least, speaks against efficient cache usage and makes it's more likely to result in cache thrashing. In any case, we're talking about a historical problem they already fixed. Stefan
![](https://secure.gravatar.com/avatar/f3ba3ecffd20251d73749afbfa636786.jpg?s=120&d=mm&r=g)
On Tue, May 24, 2011 at 10:05 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Maciej Fijalkowski, 24.05.2011 13:31:
CPython was not designed for CPU cache usage as far as I'm aware.
That's a pretty bold statement to make on this list. Even if it wasn't originally "designed" for (efficient?) CPU cache usage, it's certainly been around for long enough to have received numerous performance tweaks in that regard.
As a statement of Guido's original intent, I'd side with Maciej (Guido has made it pretty clear that he subscribes to the "first, make it work, and only worry about making it faster if that first approach isn't good enough" school of thought). Various *parts* of CPython, on the other hand, have indeed been optimised over the years to be quite aware of potential low level CPU and RAM effects (e.g. dicts, sorting, the small object allocator). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
![](https://secure.gravatar.com/avatar/6a63f6b04556912b0763100879d5053d.jpg?s=120&d=mm&r=g)
2011/5/24 Stefan Behnel <stefan_ml@behnel.de>
Maciej Fijalkowski, 24.05.2011 13:31:
CPython was not designed for CPU cache usage as far as I'm aware.
That's a pretty bold statement to make on this list. Even if it wasn't originally "designed" for (efficient?) CPU cache usage, it's certainly been around for long enough to have received numerous performance tweaks in that regard.
Stefan
Maybe a change on memory allocation granularity can help here. Raising it to 16 and 32 bytes for 32 and 64 bits system respectively guarantees that an access to ob_refcnt and/or ob_type will put on the cache line some other information for the same object, which is usually required by itself (except for very simple ones, such as PyNone, PyEllipsis, etc.). Think about a long, a tuple, a list, a dictionary, ecc.: all of them have some critical data after these fields, that most likely will be accessed after INCRef or type checking. Regards, Cesare
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Den 24.05.2011 13:31, skrev Maciej Fijalkowski:
Not sure what scenario exactly are you discussing here, but storing reference counts outside of objects has (at least on a single processor) worse cache locality than inside objects.
Artur Siekielski is not talking about cache locality, but copy-on-write fork on Linux et al. When reference counts are updated after forking, memory pages marked copy-on-write are copied if they store reference counts. And then he quickly runs out of memory. He wants to put reference counts and PyObjects in different pages, so only the pages with reference counts get copied. I don't think he cares about cache locality at all, but the rest of us do :-) Sturla
![](https://secure.gravatar.com/avatar/d6b9415353e04ffa6de5a8f3aaea0553.jpg?s=120&d=mm&r=g)
On 5/24/2011 8:25 AM, Sturla Molden wrote:
Artur Siekielski is not talking about cache locality, but copy-on-write fork on Linux et al.
When reference counts are updated after forking, memory pages marked copy-on-write are copied if they store reference counts. And then he quickly runs out of memory. He wants to put reference counts and PyObjects in different pages, so only the pages with reference counts get copied.
I don't think he cares about cache locality at all, but the rest of us do :-)
It seems clear that separating reference counts from objects satisfies a specialized need and should be done in a spedial, patched version of CPython rather than the general distribution. -- Terry Jan Reedy
![](https://secure.gravatar.com/avatar/9b71dd5e57d30017322d9abc6d353e55.jpg?s=120&d=mm&r=g)
On Tue, May 24, 2011 at 8:44 AM, Terry Reedy <tjreedy@udel.edu> wrote:
On 5/24/2011 8:25 AM, Sturla Molden wrote:
Artur Siekielski is not talking about cache locality, but copy-on-write fork on Linux et al.
When reference counts are updated after forking, memory pages marked copy-on-write are copied if they store reference counts. And then he quickly runs out of memory. He wants to put reference counts and PyObjects in different pages, so only the pages with reference counts get copied.
I don't think he cares about cache locality at all, but the rest of us do :-)
It seems clear that separating reference counts from objects satisfies a specialized need and should be done in a spedial, patched version of CPython rather than the general distribution.
I'm not sure I agree, especially given that the classical answer to GIL woes has been to tell people to fork() themselves. There has to be a lot of code out there that would benefit from this. Geremy Condra
participants (12)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Artur Siekielski
-
Cesare Di Mauro
-
Charles-François Natali
-
geremy condra
-
Guido van Rossum
-
Maciej Fijalkowski
-
Nick Coghlan
-
Stefan Behnel
-
Sturla Molden
-
Terry Reedy