Mike Pall:
About GC: yes, refcounting is the silent killer. ... Py_DECREF is awful ... > 3500 locations
Is it always needed?
What if a few common (constant, singleton) objects (such as None, -1, 0, 1) were declared immortal at compile-time? They would be created at initial load in a special untracked pool, and their tp_dealloc would do nothing.
The slot for tracking references would still be there, but could be ignored -- even if it went negative.
Since the reference count no longer has to be correct (for these objects), the reference counting macros could optimize to nothing when they know at compile time that they'll have one of these constant objects.
-jJ
Jewett, Jim J wrote:
Mike Pall:
About GC: yes, refcounting is the silent killer. ... Py_DECREF is awful ... > 3500 locations
Is it always needed?
What if a few common (constant, singleton) objects (such as None, -1, 0, 1) were declared immortal at compile-time? They would be created at initial load in a special untracked pool, and their tp_dealloc would do nothing.
The slot for tracking references would still be there, but could be ignored -- even if it went negative.
Since the reference count no longer has to be correct (for these objects), the reference counting macros could optimize to nothing when they know at compile time that they'll have one of these constant objects.
So instead of writing Py_DECREF(foo) would we write code like
if(foo!=Py_None&&foo!=Py_Neg1&&foo!=Py_Zero&&foo!=PyOne){ Py_DECREF(foo); }
Paul Prescod
At 11:11 PM 4/15/04 -0700, Paul Prescod wrote:
Jewett, Jim J wrote:
Mike Pall:
About GC: yes, refcounting is the silent killer. ... Py_DECREF is awful ... > 3500 locations
Is it always needed? What if a few common (constant, singleton) objects (such as None, -1, 0,
- were declared immortal at compile-time? They would be created at
initial load in a special untracked pool, and their tp_dealloc would do nothing. The slot for tracking references would still be there, but could be ignored -- even if it went negative. Since the reference count no longer has to be correct (for these objects), the reference counting macros could optimize to nothing when they know at compile time that they'll have one of these constant objects.
So instead of writing Py_DECREF(foo) would we write code like
if(foo!=Py_None&&foo!=Py_Neg1&&foo!=Py_Zero&&foo!=PyOne){ Py_DECREF(foo); }
I think the idea is more that you could skip the 'Py_INCREF(Py_None)', which is a fairly common prelude to 'return Py_None'. You'd set their refcounts to the maximum possible value, and the deallocation function would simply reset the refcount to maximum again.
I'm not sure, however, that this would be common enough to be helpful. It seems to me Py_INCREF should effectively translate to only a single machine instruction or two. I mean, it's just incrementing an integer whose address is known at compile time.
I think the idea is more that you could skip the 'Py_INCREF(Py_None)', which is a fairly common prelude to 'return Py_None'. You'd set their refcounts to the maximum possible value, and the deallocation function would simply reset the refcount to maximum again.
I'm not sure, however, that this would be common enough to be helpful. It seems to me Py_INCREF should effectively translate to only a single machine instruction or two. I mean, it's just incrementing an integer whose address is known at compile time.
I vaguely recall that this was proposed and perhaps even tried before, and not found to make a difference.
--Guido van Rossum (home page: http://www.python.org/~guido/)
Hi,
pje wrote:
I think the idea is more that you could skip the 'Py_INCREF(Py_None)', which is a fairly common prelude to 'return Py_None'. You'd set their refcounts to the maximum possible value, and the deallocation function would simply reset the refcount to maximum again.
I'm not sure, however, that this would be common enough to be helpful. It seems to me Py_INCREF should effectively translate to only a single machine instruction or two. I mean, it's just incrementing an integer whose address is known at compile time.
I don't think it would matter much performance-wise. A quick scan does not reveal a single Py_INCREF(Py_None) in any hot code section. But with more than 800 occurences it would reduce the code size a bit (5-10K). So that alone is not a good reason to throw it out.
Checking for Py_None in Py_DECREF is counter-productive because that would need another branch. The existing branch is always taken with Py_None since the refcount never goes zero. I.e. this is not an issue.
But let's look at the general case. Here's a common fragment of inlined x86 code for Py_DECREF:
mov -something(%ebp),%esi ; odep %esi mov (%esi),%eax ; idep %esi, odep %eax dec %eax ; idep %eax, odep %eax test %eax,%eax ; [idep %eax], odep flags mov %eax,(%esi) ; [idep %eax %esi] jne .L1 ; idep flags, predict
sub $0xc,%esp mov 0x4(%esi),%eax push %esi call *0x18(%eax) add $0x10,%esp .L1:
There are at least 2 to 3 data dependencies that are hard to untangle. Tough luck.
The other problem is branch prediction. I've done a little analysis with gcov on ceval.c and found that the situation is not that bad. The Py_DECREF occurences that account for 90% of the time are almost exclusively 0/100 or 100/0 cases. The remaining 10% have a few 30/70 or 50/50 cases. The folded and weighted average is around 4/96. The branch misprediction penalty accounts for less than 0.1% of the runtime.
But the deallocation itself is costly. On average the deallocator is called 15% of the time (I checked ceval.c only). This is a lot and accounts for up to 5% of the runtime (this is based on an overall profile). My guess is that allocation is a bit cheaper, but adds another 2-3%. I think tuples are the worst offender (again only a good guess from my profiling experiments).
To summarize: I think we can optimize the GC a little more (have not done an in-depth analysis yet), but we really should focus on making Python less malloc-happy. Suggestions welcome!
Bye, Mike
On Fri, Apr 16, 2004, Mike Pall wrote:
To summarize: I think we can optimize the GC a little more (have not done an in-depth analysis yet), but we really should focus on making Python less malloc-happy. Suggestions welcome!
Could you expand on this? What do you mean by "less malloc-happy"? What you were talking about earlier was ``None``, which doesn't go through malloc hoops.
I think you might find it useful to at least read old threads about refcounting vs. GC. (I managed to dig up the old thread about getting rid of refcounting for ``None``; don't have time for more.)
Hi,
Aahz wrote:
On Fri, Apr 16, 2004, Mike Pall wrote:
To summarize: I think we can optimize the GC a little more (have not done an in-depth analysis yet), but we really should focus on making Python less malloc-happy. Suggestions welcome!
Could you expand on this? What do you mean by "less malloc-happy"? What you were talking about earlier was ``None``, which doesn't go through malloc hoops.
I think you might find it useful to at least read old threads about refcounting vs. GC. (I managed to dig up the old thread about getting rid of refcounting for ``None``; don't have time for more.)
I'm very well aware of the tradeoffs between refcounting and pure GC schemes. I do *not* suggest that we change CPython's approach. And I do not intend to start another refcount-vs-GC flamewar. You've got me wrong.
Also please check the thread -- I never suggested modifying anything about Py_None. I just presented a short quantitative analysis that it's useless to do so (the thread you referenced contains another good reason).
What I meant with 'less malloc-happy' is that we should try to reduce the number of allocations/deallocations for internal (!) objects (I'm not talking about objects explicitly managed by the application). Many of them have a very short time-to-live. I guess some of them could be recycled or even optimized away.
E.g. global profiling indicates that tuple allocation/deallocation has a noticeable effect on performance. There are probably other objects that have prohibitive setup/teardown cost (but are less pronounced on allocation costs).
But since this is only #4 on my list I have not given it a closer look. That's why I said 'suggestions welcome'.
Bye, Mike
Mike Pall wrote:
What I meant with 'less malloc-happy' is that we should try to reduce the number of allocations/deallocations for internal (!) objects (I'm not talking about objects explicitly managed by the application). Many of them have a very short time-to-live. I guess some of them could be recycled or even optimized away.
E.g. global profiling indicates that tuple allocation/deallocation has a noticeable effect on performance. There are probably other objects that have prohibitive setup/teardown cost (but are less pronounced on allocation costs).
The problem is that this has already been done in the past. Anybody contributing to the Python core is certainly aware that avoiding memory allocations, where possible, should be done, and indeed, special cases have been added (e.g. my addition of METH_NONE and METH_O to avoid tuple creation).
So unless a specific proposal is made, I'm doubtful that much optimization is possible without breaking core semantic aspects of the language.
Regards, Martin
Maybe something like this:
#ifdef __GCC__ #define constant_p(x) __builtin_constant_p(x) #else #define constant_p(x) 0 #endif
#define Py_INCREF(x) \ if( ! constant_p(x) || (x != Py_None && x != other immortals)) \ actual Py_INCREF body(x)
this should allow the compiler, by constant propagation, to skip actual Py_INCREF body when it sees Py_INCREF(Py_None) and when it's Py_INCREF(nonconstant) to have no test for equality with the immortal values.
does this actually work? I didn't check. I think that on some important systems there may be no equivalent to gcc's __builtin_constant_p(), and that Py_None is not a compile-time constant (*cough* DLLs *cough*)
Jeff
"Phillip J. Eby" pje@telecommunity.com:
I'm not sure, however, that this would be common enough to be helpful. It seems to me Py_INCREF should effectively translate to only a single machine instruction or two.
On a RISC architecture, probably 3 instructions -- load, increment, store.
It's the memory accesses that will be the most expensive part of this. But then, if increfing Py_None is really that common, its refcount will likely be in cache.
I doubt whether it really is very common, though, compared to increfs of all other objects.
Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
On Thu, Apr 15, 2004, Jewett, Jim J wrote:
What if a few common (constant, singleton) objects (such as None, -1, 0, 1) were declared immortal at compile-time? They would be created at initial load in a special untracked pool, and their tp_dealloc would do nothing.
The slot for tracking references would still be there, but could be ignored -- even if it went negative.
Since the reference count no longer has to be correct (for these objects), the reference counting macros could optimize to nothing when they know at compile time that they'll have one of these constant objects.
http://mail.python.org/pipermail/python-list/2003-September/thread.html#1837...