[Python-Dev] Optimization targets - refcount

Fri Apr 16 14:26:05 EDT 2004

Hi,

pje wrote:
> I think the idea is more that you could skip the 'Py_INCREF(Py_None)', 
> which is a fairly common prelude to 'return Py_None'.  You'd set their 
> refcounts to the maximum possible value, and the deallocation function 
> would simply reset the refcount to maximum again.
> 
> I'm not sure, however, that this would be common enough to be helpful.  It 
> seems to me Py_INCREF should effectively translate to only a single machine 
> instruction or two.  I mean, it's just incrementing an integer whose 
> address is known at compile time.

I don't think it would matter much performance-wise. A quick scan does
not reveal a single Py_INCREF(Py_None) in any hot code section. But with
more than 800 occurences it would reduce the code size a bit (5-10K).
So that alone is not a good reason to throw it out.

Checking for Py_None in Py_DECREF is counter-productive because that would
need another branch. The existing branch is always taken with Py_None since
the refcount never goes zero. I.e. this is not an issue.

But let's look at the general case. Here's a common fragment of inlined
x86 code for Py_DECREF:

mov    -something(%ebp),%esi	; odep %esi
mov    (%esi),%eax		; idep %esi, odep %eax
dec    %eax 			; idep %eax, odep %eax
test   %eax,%eax		; [idep %eax], odep flags
mov    %eax,(%esi)		; [idep %eax %esi]
jne    .L1 			; idep flags, predict

sub    $0xc,%esp
mov    0x4(%esi),%eax
push   %esi
call   *0x18(%eax)
add    $0x10,%esp
.L1:

There are at least 2 to 3 data dependencies that are hard to untangle.
Tough luck.

The other problem is branch prediction. I've done a little analysis with
gcov on ceval.c and found that the situation is not that bad. The Py_DECREF
occurences that account for 90% of the time are almost exclusively 0/100
or 100/0 cases. The remaining 10% have a few 30/70 or 50/50 cases.
The folded and weighted average is around 4/96. The branch misprediction
penalty accounts for less than 0.1% of the runtime.

But the deallocation itself is costly. On average the deallocator is
called 15% of the time (I checked ceval.c only). This is a lot and
accounts for up to 5% of the runtime (this is based on an overall profile).
My guess is that allocation is a bit cheaper, but adds another 2-3%.
I think tuples are the worst offender (again only a good guess from my
profiling experiments).

To summarize: I think we can optimize the GC a little more (have not done
an in-depth analysis yet), but we really should focus on making Python
less malloc-happy. Suggestions welcome!

Bye,
     Mike