A couple garbage collector questions
Alex Martelli
aleaxit at yahoo.com
Tue Apr 3 04:21:58 EDT 2001
"Douglas Alan" <nessus at mit.edu> wrote in message
news:lclmpi4n7g.fsf at gaffa.mit.edu...
[snip]
> > > backing it up with a simple mark & sweep GC seems adequate for
> > > most purposes.
>
> > Reference-counting exacts very heavy performance costs, no matter
> > what you back it up with.
>
> What's so heavy about its performance cost? Will a mark & sweep GC
> perform better on average? I'm skeptical. A compacting, copying
> garbage collector probably will, but it can double the space cost of
> your program.
You can track just about all of the classic references by starting at
http://citeseer.nj.nec.com/wilson92uniprocessor.html
The field is (deservedly) heavily researched, so you'll find loads
and loads of data pro and con any position. But I think it's a fair
summary to assert that reference counting's doubling or tripling of
the (small) overhead of any binding/rebinding, in most situations
relevant to Python, will swamp the costs connected with other GC
strategies.
"Impressionistically"...:
-- in an RC-based system, any time you're re-binding a reference
from object-a to object-b (e.g., compiling a Python statement
a = b
) this translates to the following low-level operations:
-- if a was already bound (say to object-a)
-- decrement object-a's RC field
-- test that RC for equality to 0
depending on (important!-) implementation details,
you may need an extra test to ensure against disaster
for a=a (e.g., increment RHS's RC _before_ decrementing
LHS's, if 'a=a' where a is the only reference to
object-a is a possibility)
-- given that b is bound (say to object-b)
-- increment object-b's RC field
-- copy the actual reference (e.g., address)
-- in a system not using RC, this compiles down to just:
-- copy the actual reference (e.g., address)
The object themselves will be totally unaffected by the rebinding;
this (again depending on important implementation details) may yield
_very_ substantial savings (by not bringing a whole object, or some
substantial subset thereof, into the machine's cache -- even more
important, of course, if the objects can possibly be on secondary
storage and require paging-in!).
A re-binding in a non-RC system is basically one "elementary" (memory
to memory) instruction -- say two in a typical RISC machine where
you need to go through a register for the m-to-m copy. If RC's have
to be maintained, the cost zooms up -- to at least _a few_ such
elementary instructions, even ignoring cache and paging effect. These
extra costs, albeit small ones, may have to be paid VERY frequently,
depending on the typical programming style in the language under
consideration ("single-assignment" languages, vs ones such as Python
where rebinding in loops is the norm). On the other hand, if every
elementary operation is incurring heavy interpreting overhead in any
case (in fetching and dispatching bytecodes, or, even worse, in name
lookup), then it's conceivable that a few extra machine instructions
(per re-binding elementary operation) are no huge load.
But saddling a language (or object-model) definition with the semantics
of reference counting may heavily inhibit to-the-metal optimizations
in the future, as _every_ correct implementation will need to pay
these costs-per-rebinding forevermore. I think it's instructive in
this regard that, designing a new object model after a decade of
experience with COM, Microsoft researchers chose to abandon reference
count semantics in favour of more generic garbage collection in .NET --
the performance costs of COM's guaranteed RC-semantics having proved
substantial in the light of long and widespread experience. (Experienced
COM programmers have wailed long and hard about this choice, since RC
semantics _are_ convenient indeed in many cases for the programmer, of
course -- although general-GC has its own convenience in terms of no
worries about circular-reference leaks, but that's another issue).
Alex
More information about the Python-list
mailing list