[Python-Dev] Optimization targets - refcount
mikepy-0404 at mike.de
Fri Apr 16 14:26:05 EDT 2004
> I think the idea is more that you could skip the 'Py_INCREF(Py_None)',
> which is a fairly common prelude to 'return Py_None'. You'd set their
> refcounts to the maximum possible value, and the deallocation function
> would simply reset the refcount to maximum again.
> I'm not sure, however, that this would be common enough to be helpful. It
> seems to me Py_INCREF should effectively translate to only a single machine
> instruction or two. I mean, it's just incrementing an integer whose
> address is known at compile time.
I don't think it would matter much performance-wise. A quick scan does
not reveal a single Py_INCREF(Py_None) in any hot code section. But with
more than 800 occurences it would reduce the code size a bit (5-10K).
So that alone is not a good reason to throw it out.
Checking for Py_None in Py_DECREF is counter-productive because that would
need another branch. The existing branch is always taken with Py_None since
the refcount never goes zero. I.e. this is not an issue.
But let's look at the general case. Here's a common fragment of inlined
x86 code for Py_DECREF:
mov -something(%ebp),%esi ; odep %esi
mov (%esi),%eax ; idep %esi, odep %eax
dec %eax ; idep %eax, odep %eax
test %eax,%eax ; [idep %eax], odep flags
mov %eax,(%esi) ; [idep %eax %esi]
jne .L1 ; idep flags, predict
There are at least 2 to 3 data dependencies that are hard to untangle.
The other problem is branch prediction. I've done a little analysis with
gcov on ceval.c and found that the situation is not that bad. The Py_DECREF
occurences that account for 90% of the time are almost exclusively 0/100
or 100/0 cases. The remaining 10% have a few 30/70 or 50/50 cases.
The folded and weighted average is around 4/96. The branch misprediction
penalty accounts for less than 0.1% of the runtime.
But the deallocation itself is costly. On average the deallocator is
called 15% of the time (I checked ceval.c only). This is a lot and
accounts for up to 5% of the runtime (this is based on an overall profile).
My guess is that allocation is a bit cheaper, but adds another 2-3%.
I think tuples are the worst offender (again only a good guess from my
To summarize: I think we can optimize the GC a little more (have not done
an in-depth analysis yet), but we really should focus on making Python
less malloc-happy. Suggestions welcome!
More information about the Python-Dev