Guido:
Well, it *also* abuses the ob_refcnt field.
My patch fixes that too (by abusing the gc_prev pointer to chain trash together).
How about this wild idea (which Tim & I had simultaneously yesterday): change the trashcan code to simply leave the object in the GC generational list, and let the GC code special-case objects with zero refcnt so that they are eventually properly disposed?
That could probably work. What happens when the GC is disabled? There is insidious bug here. Andrew helped me walk through it and I think we figured it out. First here's the code to trigger it: import gc class Ouch: def __del__(self): for x in range(20): list([]) def f(): gc.set_threshold(5) while 1: t = () # <-- here for i in range(300): t = [t, Ouch()] f() The line marked with "here" is where things go wrong. t used to refer to a long chain of [t, Ouch()]. The SETLOCAL macro in ceval calls Py_XDECREF(GETLOCAL(i)). That starts the deallocation of the list structure. Ouch.__del__ gets called can creates some more objects, triggering a collection. The f frame's traverse gets called and tries to follow the pointer for the t local. It points to memory that was freed by _PyTrash_destroy_chain. Hmm, now that I think about it the GC is not needed to trigger the bug: import gc gc.disable() import sys class Ouch: def __del__(self): print f_frame.f_locals['t'] def f(): global f_frame f_frame = sys._getframe() while 1: t = () for i in range(300): t = [t, Ouch()] f() I haven't figured out the correct solution yet. I'm just giving a status update. :-) Neil
[Neil Schemenauer] [mailto:nas@python.ca]
... There is insidious bug here. Andrew helped me walk through it and I think we figured it out. First here's the code to trigger it: ... t = () # <-- here ... The line marked with "here" is where things go wrong. t used to refer to a long chain of [t, Ouch()]. The SETLOCAL macro in ceval calls Py_XDECREF(GETLOCAL(i)). That starts the deallocation of the list structure. Ouch.__del__ gets called can creates some more objects, triggering a collection. The f frame's traverse gets called and tries to follow the pointer for the t local. It points to memory that was freed by _PyTrash_destroy_chain.
Yup. Guido & I bumped into that earlier in current CVS. The debug pymalloc was a great help in showing instantly that the memory had already been freed, and Guido immediately thought of ceval.c then. We each patched ceval.c locally; e.g., here's mine: #define SETLOCAL(i, value) do { PyObject *_t = GETLOCAL(i); \ GETLOCAL(i) = value; \ Py_XDECREF(_t); } while (0)
Hmm, now that I think about it the GC is not needed to trigger the bug:
True; this one is a very old bug, but one that simply never had a consequence before <wink>. Guido is seeing something stranger than I am now: on Windows, assorted test cases all pass now. On Linux, the list (but not the tuple) test is dying when releasing a lock(!). This under current CVS (and may be due to the new lock implementation on pthreads systems).
Tim Peters wrote:
We each patched ceval.c locally; e.g., here's mine:
#define SETLOCAL(i, value) do { PyObject *_t = GETLOCAL(i); \ GETLOCAL(i) = value; \ Py_XDECREF(_t); } while (0)
That's the same fix Andrew and I had in mind. My concern is that this is probably not the only bug of this type. The trashcan mechanism changes the ordering of object deallocation. What are the chances of other bugs like this lurking somewhere? Neil
#define SETLOCAL(i, value) do { PyObject *_t = GETLOCAL(i); \ GETLOCAL(i) = value; \ Py_XDECREF(_t); } while (0)
That's the same fix Andrew and I had in mind. My concern is that this is probably not the only bug of this type. The trashcan mechanism changes the ordering of object deallocation. What are the chances of other bugs like this lurking somewhere?
I've been aware of this issue for a long time (since Don Beaudry first pointed it out to me) and I've been pretty consistent in doing the right thing for globals and for other things that I knew would be accessible from outside. Too bad I missed this one, and you're right that there could be others, but I don't know how to find them systematically. Every DECREF call is suspect! Fixing the GC to only run at specific times isn't enough -- as you showed, you can exploit this by referencing the frame directly. The only safe solution is banning __del__, or moving the calls to __del__ to specific safe times (e.g. at the top of the VM switch). --Guido van Rossum (home page: http://www.python.org/~guido/)
[Neil Schemenauer]
That's the same fix Andrew and I had in mind. My concern is that this is probably not the only bug of this type. The trashcan mechanism changes the ordering of object deallocation. What are the chances of other bugs like this lurking somewhere?
I'll first expand on what "bug of this type" means: Any use of any C API gimmick G is a bug (discovered or not) if (sufficient but not necessary): 1. G can (directly or indirectly) call back into Python code; and 2A. The code using G leaves a reachable object in an inconsistent state; and/or 2B. The code using G assumes (and usually implicitly) that mutable portions of a reachable object won't change across its use of G. I've lost track of how many bugs of this sort we've fixed over the years. In an odd sense, it's a consequence of the GIL: a "free threaded" interpreter would get burned by "this kind of thing" all the time. The GIL restricts the possibilities for this kind of damage to specific call sites, but exactly *which* sites changes across releases. For a dramatic example, before cyclic GC was added, PyTuple_New was as harmless as water (too shallow to drown in, and not moving at lethal speed <wink>). Now it can lead to gc and destructor invocation, thread switches, and move list guts around in memory. As PyDict_Items learned the hard way, allocating a 2-tuple for a key/value pair can even resize the dict it's looking at. I'm certain there are more vulnerabilities in the core, but I bet we've fixed way more than half <wink>. The SETLOCAL bug is a #2A bug that's been there forever, yet even know that we know about it, I don't believe it's been the cause of any mystery-failure I ever heard of. So rather than asking what the odds are of other bugs like this (my guess is 99+% there's at least one more), I wonder what the odds are that anyone will write a program that cares. The odds of someone writing a program like Zope are too small to measure <wink>.
"NS" == Neil Schemenauer <nas@python.ca> writes:
NS> There is insidious bug here. Andrew helped me walk through it NS> and I think we figured it out. First here's the code to NS> trigger it: A simpler example: import gc class Ouch: def __del__(self): gc.collect() def f(): gc.set_threshold(5) while 1: t = () # <-- here for i in range(300): t = [t, Ouch()] f() -Barry
----------
Guido:
Well, it *also* abuses the ob_refcnt field.
Neil:
My patch fixes that too (by abusing the gc_prev pointer to chain trash together).
I think I haven't seen that patch yet.
How about this wild idea (which Tim & I had simultaneously yesterday): change the trashcan code to simply leave the object in the GC generational list, and let the GC code special-case objects with zero refcnt so that they are eventually properly disposed?
That could probably work. What happens when the GC is disabled?
The trashcan code is also disabled. Better not create cycles *or* deeply nested containers. They are similar anyway. :-)
There is insidious bug here. Andrew helped me walk through it and I think we figured it out. First here's the code to trigger it:
import gc
class Ouch: def __del__(self): for x in range(20): list([])
def f(): gc.set_threshold(5) while 1: t = () # <-- here for i in range(300): t = [t, Ouch()] f()
The line marked with "here" is where things go wrong. t used to refer to a long chain of [t, Ouch()]. The SETLOCAL macro in ceval calls Py_XDECREF(GETLOCAL(i)). That starts the deallocation of the list structure. Ouch.__del__ gets called can creates some more objects, triggering a collection. The f frame's traverse gets called and tries to follow the pointer for the t local. It points to memory that was freed by _PyTrash_destroy_chain.
Yes, Tim & I figured this out just before lunch, too. :-(
Hmm, now that I think about it the GC is not needed to trigger the bug:
import gc gc.disable() import sys
class Ouch: def __del__(self): print f_frame.f_locals['t']
def f(): global f_frame f_frame = sys._getframe() while 1: t = () for i in range(300): t = [t, Ouch()]
f()
I haven't figured out the correct solution yet. I'm just giving a status update. :-)
This is my patch: Index: ceval.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Python/ceval.c,v retrieving revision 2.308 diff -c -r2.308 ceval.c *** ceval.c 24 Mar 2002 19:25:00 -0000 2.308 --- ceval.c 28 Mar 2002 18:21:09 -0000 *************** *** 554,561 **** /* Local variable macros */ #define GETLOCAL(i) (fastlocals[i]) ! #define SETLOCAL(i, value) do { Py_XDECREF(GETLOCAL(i)); \ ! GETLOCAL(i) = value; } while (0) /* Start of code */ --- 554,562 ---- /* Local variable macros */ #define GETLOCAL(i) (fastlocals[i]) ! #define SETLOCAL(i, value) do { PyObject *tmp = GETLOCAL(i); \ ! GETLOCAL(i) = value; \ ! Py_XDECREF(tmp); } while (0) /* Start of code */ --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (4)
-
barry@zope.com -
Guido van Rossum -
Neil Schemenauer -
Tim Peters