[Python-Dev] Garbage collector problem

Kevin Jacobs jacobs@penguin.theopalgroup.com
Fri, 28 Jun 2002 10:44:44 -0400 (EDT)


I've found what I consider a major problem with the garbage collector in the
Python 2.3 CVS tree.  Here is a small kernel that demonstrates the problem:

lst = []
for i in range(100000):
  lst.append( (1,) )

The key ingredients are:

  1) A method is called on a container (rather than __setitem__ or
     __setattr__).

  2) A new object is allocated while the method object lives on the Python
     VM stack, as shown by the disassembled bytecodes:

         40 LOAD_FAST                1 (lst)
         43 LOAD_ATTR                3 (append)
         46 LOAD_CONST               2 (1)
         49 BUILD_TUPLE              1
         52 CALL_FUNCTION            1

These ingredients combine in the following way to trigger quadratic-time
behavior in the Python garbage collector:

  * First, the LOAD_ATTR on "lst" for "append" is called, and a PyCFunction
    is returned from this code in descrobject.c:method_get:

        return PyCFunction_New(descr->d_method, obj);

    Thus, a _new_ PyCFunction is allocated every time the method is
    requested.

  * This new method object is added to generation 0 of the garbage
    collector, which holds a reference to "lst".

  * The BUILD_TUPLE call may then trigger a garbage collection cycle.

  * Since the "append" method is in generation 0, the reference traversal
    must also follow all objects within "lst", even if "lst" is in
    generation 1 or 2.  This traversal requires time linear in the number of
    objects in "lst", thus increasing the overall time complexity of the
    code to quadratic in the number of elements in "lst".

Also note that this is a much more general problem than this small example. 
It can affect many types of objects in addition to methods, including
descriptors, iterator objects, and any other object that contains a "back
reference".

So, what can be done about this.... One simple solution would be to not
traverse some "back references" if we are collecting objects in generation
0.  This will avoid traversing virtually all of these ephemoral objects that
will trigger such expensive behavior.  If they live long enough to pass
through to generation one or two, then clearly they should be traversed.

So, what do all of you GC gurus think?  Provided that my analysis is sound,
I can rapidly propose a patch to demonstrate this approach if there is
sufficient positive sentiment.

There is a bug open on sourceforge on this issue, so feel free to reply via
python-dev or via the bug -- I read both.  As usual sourceforge is buggered,
so I have not been able to update the bug with the contents of this e-mail.

  http://sourceforge.net/tracker/?func=detail&atid=105470&aid=572567&group_id=5470

Regards,
-Kevin

PS: I have not looked into why this doesn't happen in Python 2.2.x or
    before.  I suspect that it must be related to the recent GC changes in
    methodobject.py.  I'm not motivated to spend much time looking into
    this, because the current GC behavior is technically correct, though
    clearly sub-optimal.

--
Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19         E-mail: jacobs@theopalgroup.com
Fax:   (216) 986-0714              WWW:    http://www.theopalgroup.com