[Python-Dev] Fun with 2.3 shutdown

Thu Sep 18 23:06:01 EDT 2003

When the Zope3 tests are run under Python 2.3, after the test runner ends we
usually get treated to a long string of these things:

"""
Unhandled exception in thread started by
Error in sys.excepthook:
Original exception was:

"""

All that output is from Python, but it isn't particularly helpful <wink>.

Here's what I know:

For Zope readers:  these are indeed due to the _Checkpoint threads that are
left behind, sitting in

            self._event.wait(self._interval)

self._interval is 120 seconds (i.e., a relatively long time).

For Python readers:

Event.wait() with a timeout ends up in _Condition.wait(), where a lazy busy
loop wakes up about 20 times per second to see whether it can proceed.

For some reason an exception is getting raised in the wait() code.  I'm not
exactly sure what or why yet, but that will come soon enough.  For this
message, it's enough to know that it happens (an educated guess comes later
in this msg).

Python is in Py_Finalize, in its (new in 2.3) second call of PyGC_Collect():

	/* Collect garbage.  This may call finalizers; it's nice to
         call these before all modules are destroyed. */
	PyGC_Collect();

	/* Destroy all modules */
	PyImport_Cleanup();

	/* Collect final garbage.  This disposes of cycles created by
	   new-style class definitions, for example. */
	PyGC_Collect();    WE'RE IN THIS ONE

*Presumably* the _Condition.wait() code wakes up after modules have been
destroyed, and raises an exception due to referencing a None'ed-out
threading.py global.  A fact is that PyGC_Collect has called a __del__
method when the problem occurs, and that's why some other thread was able to
get the GIL and proceed (gc.collect() gave control back to Python code, via
a Python __del__ method).

*Presumably* this only causes a visible problem when running "at least most
of" the Zope3 tests because the Zope3 tests create a ton of containers and
gc takes significant time then.  Presumably if just a few tests are run, the
second call to PyGC_Collect() goes so fast that it's unlikely the
_Condition.wait() code will wake up and blow up before Python exits.

Back to for-sure stuff:  the thread does raise an exception, and we wind up
back in threadmodule.c t_bootstrap:

	if (res == NULL) {   IT IS NULL
		if (PyErr_ExceptionMatches(PyExc_SystemExit))
			PyErr_Clear();
		else {
			PyObject *file;
			PySys_WriteStderr(
				"Unhandled exception in thread started by ");
			file = PySys_GetObject("stderr");
			if (file)
				PyFile_WriteObject(boot->func, file, 0);
			else
				PyObject_Print(boot->func, stderr, 0);
			PySys_WriteStderr("\n");
			PyErr_PrintEx(0);
		}
	}

It's not PyExc_SystemExit, so we write

    Unhandled exception in thread started by

to C's stderr and then get sys.stderr.  Heh.  sys has been wiped by this
point, and file gets is Py_None.  Brrrrrrrr.  Everything after this is a
comedy of unintended errors.  file is true (so far as C is concerned), so we
do

		PyFile_WriteObject(boot->func, file, 0);

That tries some tests that don't pass, then does

	writer = PyObject_GetAttrString(f, "write");

None.write doesn't exist, so the exception we started with is lost and
replaced by an unwanted AttributeError.

PyFile_WriteObject returns -1 then, but t_bootstrap doesn't check for that.
Instead it writes a newline, and goes on to

			PyErr_PrintEx(0);

That eventually does

	hook = PySys_GetObject("excepthook");

and for a now-familiar reason, that also returns None.  Again hook looks
true to C, and

		PyObject *result = PyEval_CallObject(hook, args);

overwrites the unwanted AttributeError with a different unwanted exception.
PyErr_PrintEx notices that the call fails, and prints

    "Error in sys.excepthook:\n"

to C's stderr.  It then tries

			PyErr_Display(exception2, v2, tb2);

That starts with

	PyObject *f = PySys_GetObject("stderr");

and *again* that returns None from the ruined sys module, and again f looks
true to C.  It becomes something of a mess then, so suffice it to say that
PyErr_Display's attempts to write to file None via

    PyFile_WriteString(buf, f)

don't accomplish anything, and then PyErr_Display() ends with

	/* If an error happened here, don't show it.
	   XXX This is wrong, but too many callers rely on this behavior. */
	if (err != 0)
		PyErr_Clear();

So we come back with no exception set, and PyErr_PrintEx proceeds to write

    "\nOriginal exception was:\n"

to C's stderr.  Then it tries

			PyErr_Display(exception, v, tb);

and that goes thru the long dance of retrieving None from sys.stderr (etc)
all over again.

The best thing for the Zope3 tests would be to find ways not to leave
useless daemon threads hanging around.

I'm not sure what it says about Python's shutdown strategy.  If I find and
fix a None-module-global problem in _Condition.wait(), that's fine and
probably helpful (in 2.3.1), but in the context of Zope3 the run() methods
of these threads are going to trigger None-module-global problems of their
own.  That can also be fixed.

So a combination of changes to threading.py and Zope3 code can probably hide
the shutdown glitches.  Somehow, though, it just doesn't feel entirely
satisfactory <0.9 wink>.

seemingly-out-of-gordian-knot-cutters-for-this-week-ly y'rs  - tim