[Python-bugs-list] [ python-Bugs-742911 ] Memory fault on complex weakref/weakkeydict delete

Sat, 31 May 2003 12:21:14 -0700

Bugs item #742911, was opened at 2003-05-24 18:29
Message generated for change (Comment added) made by mcfletch
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=742911&group_id=5470

Category: Python Interpreter Core
Group: Python 2.2.2
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Mike C. Fletcher (mcfletch)
Assigned to: Guido van Rossum (gvanrossum)
Summary: Memory fault on complex weakref/weakkeydict delete

Initial Comment:
Attached find two modules which together form a
test-case.  The cache.py file is ripped out of a
production system (OpenGLContext), and I am seeing
memory faults under both Python 2.2.2 and 2.2.3 when I
run the code.  Under 2.2.2 while single-stepping
through the code I was able to provoke an error-message:

Fatal Python error: GC object already in linked list

The error message doesn't show up under 2.2.3, but the
memory-fault does.

Modules here don't use any extension modules, so there
shouldn't be any loose memory references or the like. 
Note, you'll likely need to make weakkeydictionary's
__delitem__ use keys instead of iterkeys to even get to
the crashing.

----------------------------------------------------------------------

>Comment By: Mike C. Fletcher (mcfletch)
Date: 2003-05-31 15:21

Message:
Logged In: YES 
user_id=34901

2.2.3 fixes the original problem on Win2k.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2003-05-29 17:35

Message:
Logged In: YES 
user_id=33168

2.3 works for me with the original, but didn't before. 
Redhat 9.

----------------------------------------------------------------------

Comment By: Jeff Epler (jepler)
Date: 2003-05-29 16:47

Message:
Logged In: YES 
user_id=2772

Just one question: does the final test-case cover the
original bug?  I have a redhat 9 system with Python 2.2.2,
and temp2.py runs just fine:
$ python2 temp2.py 
<__main__.Oops object at 0x8162ecc>
$

The Python version is:
Python 2.2.2 (#1, Feb 24 2003, 19:13:11) 
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2

I have not tried any other version of the test-case.  I also
didn't try a non-redhat version, so it's barely possible
that some patch they apply affected this issue.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-05-29 10:53

Message:
Logged In: YES 
user_id=31435

Just changed Resolution to Fixed.  Thanks everyone!

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-05-29 10:35

Message:
Logged In: YES 
user_id=6380

Thanks to Tim for the analysis, to Neal for the simplified
test, and to Mike for the bug report! The fix was actually
simple: clear the weak ref list *before* calling __del__ or
clearing __dict__. This is the same order as used by classic
classes, so can't be wrong. Applied to 2.2.3 branch and 2.3
trunk, with test case.

----------------------------------------------------------------------

Comment By: Mike C. Fletcher (mcfletch)
Date: 2003-05-28 21:16

Message:
Logged In: YES 
user_id=34901

To give timeline: This is from real-life code (PyOpenGL's
OpenGLContext demo/teaching module).  It's currently about
to go beta, with a release date target of end-of-summer. 
I've worked around the problem for the most common case w/in
the system (exiting from the VRML browser), but that
work-around's not usable for end-user long-running projects
until the users can deallocate the scenegraph to load
another (i.e. go from world to world within a single run of
the application).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-05-28 19:26

Message:
Logged In: YES 
user_id=31435

That's cool.  The analysis is much easier to follow if you run 
Neal's little program, which crashes right away in a debug 
build.  Then the analysis just follows the call stack, and it's 
instantly obvious (if you know to look for it <wink>) that 
we're trying to deallocate a single object twice.

One thing subtype_delloc does that instance_dealloc 
doesn't is clear the instance dict before clearing the 
weakrefs.  Clearing the instance dict can end up executing 
anything, and in this case does, in particular materializing a 
strong reference to the object we're in the middle of 
deallocating (via a weakref that hasn't been cleared).  That 
seems to be the key point.

Mike found the problem in 2.2.2, I believe in his real-life 
OpenGL code.  That's why I'm keen to see it fixed for 2.2.3.

I'll be in the office Thursday, BTW.

Ah, here, I'll attach a simpler program that has the same 
kind of problem (temp2.py).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2003-05-28 18:16

Message:
Logged In: YES 
user_id=6380

Tim, let's look at this when you're back in the office. My
head spins from just reading the analysis below.

Note that this is a 2.2 and 2.3 bug. I don't necessarily
want to hold up the 2.2.3 release until this is fixed,
unless we have a quick breakthrough.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-05-25 00:49

Message:
Logged In: YES 
user_id=31435

Assigned to Guido, because I suspect the problem lies in 
the order subtype_dealloc does things.

With reference to Neal's whittled-down case:  when 
makeSome() ends, we decref i and then decref item.  item's 
refcount hits 0 then.  There's a weakref remaining to item 
(in CacheHolder.client), but subtype_dealloc doesn't clear 
the weakref at this point.  First it clears item's instance 
dict.  That contains the last strong reference to i.  
subtype_dealloc is inovked again, and clears i's instance 
dict, and then deals with the weak reference to i.  The 
weakref to i has a callback associated with it, and 
CacheHolder.__call__() is invoked.  That invokes self.client
(), still a weakref to item, and because item's weakrefs still 
haven't been dealt with, self.client() returns item.

Now we're hosed.  item *had* a refcount of 0 at this point, 
and is still in the process of getting cleaned out by the first 
call to subtype_dealloc (it still thinks it's clearing item's 
instance dict).  We already called _Py_ForgetReference on 
item when its refcount fell to 0.  Its refcount gets boosted 
back to 1 by virtue of item getting returned by the 
self.client() weakref call.  Cleaning out the frame for 
CacheHolder.__call__() knocks the refcount down to 0 
again, and the second attempt to call _Py_ForgetReference 
on it blows up.

In a release build, nothing visibly bad happens when I try 
it.  It crashes if I add

        print client.items

at the end of __call__ in a release-build run, though.  Looks 
like that's just the luck of the draw after things have gone 
insane.

I note that instance_dealloc deals with weakrefs much 
earlier in the process, so that when Neal makes Items a 
classic class instead, the eventual call to self.client() 
returns None (instead of items), and nothing bad happens.. 

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-05-24 22:53

Message:
Logged In: YES 
user_id=31435

Outstanding, Neal -- thanks!  I can confirm that this crashes 
in a current 2.3 debug build on Windows too.  I'm feeling 
sick and won't pursue it now, though.  When cleaning up 
from the call to makeSome() (the body of makeSome() has 
completed, and we're cleaning up its execution frame, 
decref'ing the locals), we're dying in _Py_ForgetReference 
with a NULL-pointer derefernce.  The refcount on an Items 
object has just fallen to 0, and the code is trying to verify 
that the object is in the debug-build "list of all objects".  But 
its prev and next pointers are both NULL -- it's not in the 
list, and simply trying to check that it is blows up.

I don't have a theory for what's causing this, but it's 
probably not a good thing <heh>.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2003-05-24 19:31

Message:
Logged In: YES 
user_id=33168

I cut out a lot of stuff from the test.  The new file is
much more minimal, but still crashes for me in a 2.3 debug
build.  You only need the one file too (not both files).

There is an issue with new style classes.  If Items doesn't
derive from object, I don't get a crash.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=742911&group_id=5470