[Python-bugs-list] [Bug #113812] Serious garbage collection problems with 2.0b1

noreply@sourceforge.net noreply@sourceforge.net
Thu, 14 Sep 2000 23:43:25 -0700


Bug #113812, was updated on 2000-Sep-07 10:10
Here is a current snapshot of the bug.

Project: Python
Category: Modules
Status: Open
Resolution: None
Bug Group: None
Priority: 8
Summary: Serious garbage collection problems with 2.0b1

Details: Since I've installed version 2.0b1, I ran into a few (serious) problems
with a non-trivial application (+80,000 lines of Python). It seems that
the are all related to the new garbage collection:

 - Suddenly, assertions fail because lists have become empty 
   while they shouldn't have. I looks like their elements have been
   garbage collected while they were still reachable.

 - Under other circumstances, I get coredumps which always 
   seem to happen in function "move_root_reachable" of gcmodule.c:
   the "traverse" function pointer seems to contain a bogus address.

 - When I run a Purified version of the interpreter, I can't reproduce
   either problem, but instead, the garbage collector seems to get
   stuck in an endless loop each time. This always happens in the
   same function "move_root_reachable". Purify doesn't produce any 
   relevant warning.

The code runs just fine with version 1.6 of the interpreter, or when I
disable the garbage collector.

Probably relevant is the fact that the objects (10,000's) in my application
are very heavily cross-linked (nearly all links are bi-directional), which 
probably puts a lot of stress on the garbage collector.

My platform: HP-UX 10.20 / c89 compiler


Follow-Ups:

Date: 2000-Sep-07 10:14
By: edg

Comment:
Just one more thing: when I turn on the gc debugging,
the interpreter also seems to get stuck in an endless loop.
-------------------------------------------------------

Date: 2000-Sep-07 14:41
By: jhylton

Comment:
If you set the GC threshold to 0, 
import gc
gc.set_threshold(0)

Do you get the same problem?  Not that I doubt there is some sort of gc problem, but I wonder if there is something going wrong in the accounting or in the collection.

How hard would it be for someone to try to reproduce this bug?  Obviously, it would be helpful to get a smaller test case that has the same behavior as your large program and also tickles the bug.

Do you have any C extension modules in your application or is it pure Python?
-------------------------------------------------------

Date: 2000-Sep-07 15:06
By: jhylton

Comment:
Please do triage on this bug.
-------------------------------------------------------

Date: 2000-Sep-08 01:33
By: edg

Comment:
The code is 100% pure Python. 
When I set the threshold to 0, the problem doesn't occur.

The problem is probably very hard to reproduce by anyone else.
The slightest change in the input data for my application
can make the problem go away. Even running an optimized instead
of a debug version of the interpreter can make a difference.

I'm certainly going to try to strip it down, but that won't be 
easy (the same application, using other input data, running 
6 times as long, runs just fine).
-------------------------------------------------------

Date: 2000-Sep-11 07:48
By: edg

Comment:
After 2 full days of debugging, I think that I finally found the 
cause for the gc problems.

To keep a (very) long story short, what I found out was the following:

 - Crashes were due to objects that were destructed twice.
 
 - Endless loops were due to messed-up gc generation lists. The lists should
   always remain perfectly circular, but sometimes they ended up like this:

   list <-> ... <-> X <-> ... <-> ... -> X   

   No wonder that the gc code could easily get stuck; even turning on
   gc debugging caused the counting code to run in circles forever.
   
   The multiple-destruction problem is almost certainly caused by lists being
   messed up.
   
By turning on the debugging code in gc_list_remove and reducing the gc
threshold to a very small value, I could trigger the crash more
reliably, which allowed me to strip down my 80000 line application to this:

-----------------------------------------------------------------------------
#
# Note: to trigger a crash reliably, the debugging code in gc_list_remove
#       _must_ be turned on.
#
import gc
gc.set_threshold(1)

class Node:
   
   def __del__(self):
      dir(self)
 
a = Node()
del a # -> Crash
-----------------------------------------------------------------------------

You wonder: can it be that simple ? :-)

This is what happens when the Node instance `a' is destructed:

 1) The Node instance is removed from the gc lists.
 2) An instance method is created due to the call of the __del__ method.
    THAT METHOD CREATES A NEW REFERENCE TO THE INSTANCE !
 3) The code in the __del__ method triggers the allocation of new
    objects and because the gc threshold has been set very low, it also
    triggers a gc run. 
 4) During the gc run, the instance method is encountered and its reachable
    objects are visited.
 5) Since the instance is referenced by the method, the gc code tries to move
    the instance to another list, while it was no longer present in any list
    -> BINGO

Obviously, the reason why this problem was so hard to reproduce, is the fact
that most classes don't have a __del__ method, and the problem only occurs
when a gc run happens during the execution of a __del__ method.

It think the fix is as simple as this (I'm not too confident, but it 
seems to work):
------------------------------------------------------------------------------
*** Objects/classobject.c.orig	Mon Sep 11 15:55:03 2000
--- Objects/classobject.c	Mon Sep 11 16:12:26 2000
***************
*** 490,496 ****
  #ifdef Py_TRACE_REFS
  	extern long _Py_RefTotal;
  #endif
- 	PyObject_GC_Fini(inst);
  	/* Call the __del__ method if it exists.  First temporarily
  	   revive the object and save the current exception, if any. */
  #ifdef Py_TRACE_REFS
--- 490,495 ----
***************
*** 523,529 ****
  #ifdef COUNT_ALLOCS
  		inst->ob_type->tp_free--;
  #endif
- 		PyObject_GC_Init((PyObject *)inst);
  		return; /* __del__ added a reference; don't delete now */
  	}
  #ifdef Py_TRACE_REFS
--- 522,527 ----
***************
*** 537,542 ****
--- 535,541 ----
  #endif /* Py_TRACE_REFS */
  	Py_DECREF(inst->in_class);
  	Py_XDECREF(inst->in_dict);
+ 	PyObject_GC_Fini(inst);
  	inst = (PyInstanceObject *) PyObject_AS_GC(inst);
  	PyObject_DEL(inst);
  }
------------------------------------------------------------------------------

ie, delay the removal from the gc list till everything has stabilized.

I hope this helps.


-------------------------------------------------------

Date: 2000-Sep-14 23:43
By: nascheme

Comment:
Your analysis looks correct.  Great work.  I think there is a small problem with your fix however.  You should call PyObject_GC_Fini() before the DECREFs on in_class and in_dict.  If, for some reason, decrementing the reference counts of these object causes a garbage collection then the instance could still be on the gc lists and have an invalid in_class or in_dict pointer.
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=113812&group_id=5470