
When the Zope3 tests are run under Python 2.3, after the test runner ends we usually get treated to a long string of these things: """ Unhandled exception in thread started by Error in sys.excepthook: Original exception was: """ All that output is from Python, but it isn't particularly helpful <wink>. Here's what I know: For Zope readers: these are indeed due to the _Checkpoint threads that are left behind, sitting in self._event.wait(self._interval) self._interval is 120 seconds (i.e., a relatively long time). For Python readers: Event.wait() with a timeout ends up in _Condition.wait(), where a lazy busy loop wakes up about 20 times per second to see whether it can proceed. For some reason an exception is getting raised in the wait() code. I'm not exactly sure what or why yet, but that will come soon enough. For this message, it's enough to know that it happens (an educated guess comes later in this msg). Python is in Py_Finalize, in its (new in 2.3) second call of PyGC_Collect(): /* Collect garbage. This may call finalizers; it's nice to call these before all modules are destroyed. */ PyGC_Collect(); /* Destroy all modules */ PyImport_Cleanup(); /* Collect final garbage. This disposes of cycles created by new-style class definitions, for example. */ PyGC_Collect(); WE'RE IN THIS ONE *Presumably* the _Condition.wait() code wakes up after modules have been destroyed, and raises an exception due to referencing a None'ed-out threading.py global. A fact is that PyGC_Collect has called a __del__ method when the problem occurs, and that's why some other thread was able to get the GIL and proceed (gc.collect() gave control back to Python code, via a Python __del__ method). *Presumably* this only causes a visible problem when running "at least most of" the Zope3 tests because the Zope3 tests create a ton of containers and gc takes significant time then. Presumably if just a few tests are run, the second call to PyGC_Collect() goes so fast that it's unlikely the _Condition.wait() code will wake up and blow up before Python exits. Back to for-sure stuff: the thread does raise an exception, and we wind up back in threadmodule.c t_bootstrap: if (res == NULL) { IT IS NULL if (PyErr_ExceptionMatches(PyExc_SystemExit)) PyErr_Clear(); else { PyObject *file; PySys_WriteStderr( "Unhandled exception in thread started by "); file = PySys_GetObject("stderr"); if (file) PyFile_WriteObject(boot->func, file, 0); else PyObject_Print(boot->func, stderr, 0); PySys_WriteStderr("\n"); PyErr_PrintEx(0); } } It's not PyExc_SystemExit, so we write Unhandled exception in thread started by to C's stderr and then get sys.stderr. Heh. sys has been wiped by this point, and file gets is Py_None. Brrrrrrrr. Everything after this is a comedy of unintended errors. file is true (so far as C is concerned), so we do PyFile_WriteObject(boot->func, file, 0); That tries some tests that don't pass, then does writer = PyObject_GetAttrString(f, "write"); None.write doesn't exist, so the exception we started with is lost and replaced by an unwanted AttributeError. PyFile_WriteObject returns -1 then, but t_bootstrap doesn't check for that. Instead it writes a newline, and goes on to PyErr_PrintEx(0); That eventually does hook = PySys_GetObject("excepthook"); and for a now-familiar reason, that also returns None. Again hook looks true to C, and PyObject *result = PyEval_CallObject(hook, args); overwrites the unwanted AttributeError with a different unwanted exception. PyErr_PrintEx notices that the call fails, and prints "Error in sys.excepthook:\n" to C's stderr. It then tries PyErr_Display(exception2, v2, tb2); That starts with PyObject *f = PySys_GetObject("stderr"); and *again* that returns None from the ruined sys module, and again f looks true to C. It becomes something of a mess then, so suffice it to say that PyErr_Display's attempts to write to file None via PyFile_WriteString(buf, f) don't accomplish anything, and then PyErr_Display() ends with /* If an error happened here, don't show it. XXX This is wrong, but too many callers rely on this behavior. */ if (err != 0) PyErr_Clear(); So we come back with no exception set, and PyErr_PrintEx proceeds to write "\nOriginal exception was:\n" to C's stderr. Then it tries PyErr_Display(exception, v, tb); and that goes thru the long dance of retrieving None from sys.stderr (etc) all over again. The best thing for the Zope3 tests would be to find ways not to leave useless daemon threads hanging around. I'm not sure what it says about Python's shutdown strategy. If I find and fix a None-module-global problem in _Condition.wait(), that's fine and probably helpful (in 2.3.1), but in the context of Zope3 the run() methods of these threads are going to trigger None-module-global problems of their own. That can also be fixed. So a combination of changes to threading.py and Zope3 code can probably hide the shutdown glitches. Somehow, though, it just doesn't feel entirely satisfactory <0.9 wink>. seemingly-out-of-gordian-knot-cutters-for-this-week-ly y'rs - tim

"Tim Peters" wrote When the Zope3 tests are run under Python 2.3, after the test runner ends we usually get treated to a long string of these things:
""" Unhandled exception in thread started by Error in sys.excepthook: Original exception was:
"""
Yep. I remember mentioning these at the Melbourne sprint.
I'm not sure what it says about Python's shutdown strategy. If I find and fix a None-module-global problem in _Condition.wait(), that's fine and probably helpful (in 2.3.1), but in the context of Zope3 the run() methods of these threads are going to trigger None-module-global problems of their own. That can also be fixed.
Hm. Is this serious enough a problem to hold off 2.3.1 until it's fixed? My feeling is "no", because it only affects Z3 developers, and they're hardened to random pain by now, and it's also just a shutdown glitch. On the other hand, making the Z3 shutdown code os.kill(os.getpid(), 9) would also make the messages go away ;) Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

On Fri, 2003-09-19 at 01:37, Anthony Baxter wrote:
Hm. Is this serious enough a problem to hold off 2.3.1 until it's fixed? My feeling is "no", because it only affects Z3 developers, and they're hardened to random pain by now, and it's also just a shutdown glitch.
I don't think it's worth waiting, because we don't have a plan for fixing it. Even if we did, it's arguable whether it belongs in a bug fix release -- at least not without extensive developer testing with the CVS head. Jeremy

Hello Tim, The behavior of Python regarding whether threads should continue to run after the main one exited has never been too clear. Wouldn't it make sense to try to control this aspect more precisely ? I have also experienced this kind of shutdown glitches. It definitely needs more thinking, but what about this one: Python would only shutdown after all threads are finished; but exiting the main thread would by default send a SystemExit exception to all current threads, if that can be done cleanly (oh well, this problem again). Another question, what was the rationale for setting module globals to None instead of simply deleting them (and getting the expected AttributeErrors both at C and at Python level) ? Armin

Armin Rigo wrote: ...
Another question, what was the rationale for setting module globals to None instead of simply deleting them (and getting the expected AttributeErrors both at C and at Python level) ?
Or, given cyclic GC, why isn't it enough to clear sys.modules? In theory, shouldn't that make all modules (and thier globals) collectable unless there are threads hanging on to them? Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (703) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

Hello Jim, On Sun, Sep 21, 2003 at 06:06:45AM -0400, Jim Fulton wrote:
Or, given cyclic GC, why isn't it enough to clear sys.modules?
Because the modules dictionary is stored in PyThreadState_Get()->interp->modules, where the GC cannot see it, which prevent the dictionary from being freed; in turn, it prevents any module from being freed. Still, it means (at a first sight) that there is actually no need to clear each module's globals at all. Just removing them all from interp->modules is fine. In practice there are probably a few issues that I'm not fully aware of, like what occurs if we change module_dealloc() so that it no longer clears the module's global dictionary -- a tweak that seems to come from a pre-GC history. Not speaking about threads in particular, and compatibility problems apart, wouldn't it make sense to change sys.modules and interp->modules into a WeakValueDictionary (or some equivalent refcount-based hack like the one for interned strings)? This would give memory management at the module level, and would allow a cleaner shutdown procedure as well. It would however change Python's semantics of "a module is only loaded once unless you ask or you mess with sys.modules explicitely". A bientot, Armin.

At 04:27 PM 9/23/03 +0100, Armin Rigo wrote:
Not speaking about threads in particular, and compatibility problems apart, wouldn't it make sense to change sys.modules and interp->modules into a WeakValueDictionary (or some equivalent refcount-based hack like the one for interned strings)? This would give memory management at the module level, and would allow a cleaner shutdown procedure as well. It would however change Python's semantics of "a module is only loaded once unless you ask or you mess with sys.modules explicitely".
If I understand your suggestion correctly, the following code would break under such an arrangement: Foo.py: ====== x = 1 def getX(): return X Bar.py: ======= from Foo import getX # returns getX() bound to one set of globals import Foo # imports a *new* Foo module, with new dictionary! Foo.x = 2 # change the new Foo.x to 2 print getX() # call the getX() with the old globals, and print 1 So, unless a module's dictionary were to reference the module (or functions were to reference the module rather than (or in addition to) the module dictionary), it seems the proposed semantics would lead to unexpected results.

Hello Phillip, On Tue, Sep 23, 2003 at 12:09:18PM -0400, Phillip J. Eby wrote:
So, unless a module's dictionary were to reference the module (or functions were to reference the module rather than (or in addition to) the module dictionary), it seems the proposed semantics would lead to unexpected results.
Right. I'm not sure I understand the reasons behind the current module/globals relationship. As modules zap their globals with None when they are deallocated, we observe the following behavior: (foo.py) def g(): return 5 def f(): return g()
from foo import f import sys; del sys.modules['test4'] f() Traceback (most recent call last): File "<stdin>", line 1, in ? File "foo.py", line 4, in f return g() TypeError: 'NoneType' object is not callable
Possibly far-fetched, but I wouldn't be surprized to find large applications that mess with sys.modules in some way. For example, in one case, to ensure that a whole collection of interdependent modules will be reloaded on demand after I detect a change in one of them, I'm simply removing them all from sys.modules. Armin

At 05:49 PM 9/23/03 +0100, Armin Rigo wrote:
Hello Phillip,
On Tue, Sep 23, 2003 at 12:09:18PM -0400, Phillip J. Eby wrote:
So, unless a module's dictionary were to reference the module (or functions were to reference the module rather than (or in addition to) the module dictionary), it seems the proposed semantics would lead to unexpected results.
Right.
I'm not sure I understand the reasons behind the current module/globals relationship.As modules zap their globals with None when they are
Well, a function can have globals that aren't a module dictionary, so that's probably why functions don't refer to their modules directly. As for the None thing, it's to break circular references.
deallocated, we observe the following behavior:
(foo.py) def g(): return 5 def f(): return g()
from foo import f import sys; del sys.modules['test4'] f() Traceback (most recent call last): File "<stdin>", line 1, in ? File "foo.py", line 4, in f return g() TypeError: 'NoneType' object is not callable
Yeah, I've run into this (and weirder!) things before, when tracing down refcount bugs in C extensions. But, I don't think anybody in their right mind would *rely* on this behavior, so your suggestion of e.g. clearing sys.modules and letting GC do the rest seems worthy of investigation.
Possibly far-fetched, but I wouldn't be surprized to find large applications that mess with sys.modules in some way. For example, in one case, to ensure that a whole collection of interdependent modules will be reloaded on demand after I detect a change in one of them, I'm simply removing them all from sys.modules.
Of course, you have to also delete all modules and objects that reference the modules you want to delete. And that would still be the case under your "clear sys.modules at shutdown" proposal. OTOH, it's not clear what to do with uncollectable objects. E.g., if you have a module global that is or contains an object with a __del__ method, and is part of a cycle back to that module. I think you'd *still* be stuck going back and clearing the modules or setting their contents to None.

[Armin Rigo]
The behavior of Python regarding whether threads should continue to run after the main one exited has never been too clear. Wouldn't it make sense to try to control this aspect more precisely ?
threading.py already tried to. Whether or not a native platform thread outlives the main thread varies across platforms. A Thread created via threading.py prevents the main thread from exiting on all platforms so long as the Thread is running, unless the user calls its setDaemon(True) method before starting the thread. Whether a daemon Thread can outlive the main thread remains platform-dependent. These platform dependencies exist because there's no cross-platform way for Python to forcibly end a thread's life from outside the thread, or to force a thread to continue running after the main thread exits. Python *can* wait for a Thread to end on its own in a cross-platform way, and threading.py keeps the main thread alive until all non-daemon Threads have stopped running via that portable method. So it does what it reasonably *could* do.
I have also experienced this kind of shutdown glitches.
You mean the uselessly empty "Exception in thread" messages at shutdown? I've only seen those when running Zope3 tests so far (I count *non*-empty "Exception in thread" shutdown msgs as a different glitch, though). I expect every long-time Python programmer has seen the ever-popular "object 'None' has no attribute so-and-so" kinds of shutdown glitches, and we fix more of those every release by purging __del__ methods, and methods invoked by __del__ methods, of references to module globals. The segfaults at shutdown I saw later while trying to debug the Zope3 symptom were new to me.
It definitely needs more thinking, but what about this one: Python would only shutdown after all threads are finished;
In the Zope3 case, there are about a dozen Threads still running, and they'll never finish. Nine are in an Event.wait() that will never trigger; the rest are in an unbounded polling loop with a sleep(3) each time around. They're all daemon Threads, though, so asked Python explicitly not to wait for them to finish.
but exiting the main thread would by default send a SystemExit exception to all current threads, if that can be done cleanly (oh well, this problem again).
I'm afraid that would be a major change in shutdown semantics for non-daemon Threads, yes? Right now the main thread will happily wait as long as it takes for Threads to decide to stop on their own, and I've seen programs rely on that (fire up a bunch of producer and consumer Threads, then let the main thread fall off the end of the script). It might help the Zope3 problem, though, if SystemExit got raised in daemon Threads when the main thread wanted to exit. But stuffing a pending exception on a Thread's todo list is asynchronous, and in the case of the Zope3 Threads sitting in their sleep(3) polling loops, the SystemExit won't be noticed until the sleep expires. Since 3 might actually be bigger than 3 in some apps <wink>, there's really no limit on how long the main thread might have to wait to be sure all the daemon Threads noticed their SystemExit.
Another question, what was the rationale for setting module globals to None instead of simply deleting them (and getting the expected AttributeErrors both at C and at Python level) ?
I suspect it's because there's no "efficient" way to iterate over a dict and simultaneously mutate it in a size-changing way; the C PyDict_Next() protocol allows the caller to change the value associated with existing keys (like setting them to None), but is unpredictable if the caller adds or deletes keys during PyDict_Next() iteration. Well, due to quirks of the current implementation, I suspect it actually could delete all the keys without harm (the relevant quirk is that the current implementation never resizes a dict in response to a key deletion; it can resize only when a new key gets added; resizing can shuffle the keys around in a new order, which is what makes PyDict_Next unpredictable if the dict does get resized during iteration).

[Tim]
When the Zope3 tests are run under Python 2.3, after the test runner ends we usually get treated to a long string of these things:
""" Unhandled exception in thread started by Error in sys.excepthook: Original exception was:
"""
[bunch of analysis deleted]
... Event.wait() with a timeout ends up in _Condition.wait(), where a lazy busy loop wakes up about 20 times per second to see whether it can proceed.
For some reason an exception is getting raised in the wait() code. I'm not exactly sure what or why yet, but that will come soon enough.
It didn't. The primary effect of adding some vanilla debugging prints to threading.py's _Condition.wait() was to make Python die with segfaults at shutdown time instead. If Python's *second* call to PyGC_collect() in Py_Finalize() is commented out (the call that occurs after /* Destroy all modules */ PyImport_Cleanup(); ), all the problems go away, including the nonsense errors sprayed out at the end of Zope3 test runs. Recall that the nonsense errors are caused by a dozen stale daemon threads trying to execute Python code after the interpreter has been severely torn down, and they get the *chance* to do this because the second PyGC_collect() finds trash with Python __del__ methods (so PyGC_collect() loses the GIL when calling the __del__ methods, and all the daemon threads can proceed then). Note that this isn't a problem with code *in* __del__ methods! That makes it a different kind of shutdown glitch than we've usually wrestled with. It's a problem with Python code that has nothing to do with __del__; __del__'s only contribution is to release the GIL. Because the second PyGC_collect() *is* finding addtional finalizers to run, it's unattractive to stop calling it (getting more user-defined finalizers to run was the purpose of adding these PyGC_collect() shutdown calls to 2.3). OTOH, any __del__ method that runs after PyImport_Cleanup() will be (AFAICT) just as vulnerable to producing nonsense errors and segfaults as the code in _Condition.wait() has proven to be (sys is useless by that point, and all the Python internal code sucking basic objects out of sys isn't expecting to get None back). Maybe we should remove the second PyGC_collect() call before more apps run into these mysteries. Maybe we should delay tearing down sys as a special case (even more of a special case than it is now). Maybe the Zope3 tests should stop leaving an ever-growing number of daemon threads around (which appears to be the only solution so long as they're run under Python 2.3).
participants (7)
-
Anthony Baxter
-
Armin Rigo
-
Jeremy Hylton
-
Jim Fulton
-
Phillip J. Eby
-
Tim Peters
-
Tim Peters