![](https://secure.gravatar.com/avatar/8702771e2f72afdffc4fcb5527e46354.jpg?s=120&d=mm&r=g)
On 2018-09-13, Neil Schemenauer wrote:
Making Py_TYPE(), Py_INCREF(), Py_DECREF() into inline functions and adding a conditional branch to check for a tag costs roughly 8%.
I've been pondering this result. It seems surprising that such a small amount of new instructions (obviously on a super hot path) would cause such a slow-down. The disassembled code for a call to Py_INCREF in listmodule.c is below:
1ff: 48 8b 1e mov (%rsi),%rbx
if (_Py_IsTaggedPtr(op)) {
202: f6 c3 01 test $0x1,%bl
205: 0f 85 00 00 00 00 jne 20b <PyList_AsTuple+0xdb>
((PyObject *)(op))->ob_refcnt++);
20b: 48 83 03 01 addq $0x1,(%rbx)
The extra instructions that the tagging adds is the "test" and "jne". I compiled with PGO so branches should be setup to best use likely/unlikey branch paths.
The cycles for the memory write are more difficult to estimate. If we assume the refcnt is in L2 cache on a Haswell processor, the latency is 12 cycles. For L1, 4 cycles.
So, the fact that these extra two instructions add about 8% overhead is an interesting result. I think it means that Py_INCREF and Py_DECREF represent a huge amount of the CPU cycles used by real programs. Making a non-refcount GC has all kinds of challenges. But, there must be a lot of overhead that can be removed by removing INCREF/DECREF.
I really want a revised C-API that allows a non-refcount core GC with recounting for "handles" passed to extensions. It opens the door to someone in the future making a better GC. With the API we have today, it can't happen without either breaking most C extensions or at least taking a huge performance hit for them.
Regards,
Neil