[Python-Dev] The recursion checking problem

Sat Aug 30 22:06:00 CEST 2008

Hi,

I was working on a recursion overflow checking bug
(http://bugs.python.org/issue2548) and, while I've managed to produce a working
patch, I've also become uncomfortable with the very idea of trying to plug all
those holes just for the sake of plugging them. I'll try to explain why, by
describing the conflicting factors I've identified:

- more and more, we are adding calls to Py_EnterRecursiveCall() and
Py_LeaveRecursiveCall() all over the interpreter, to avoid special/obscure
cases of undetected infinite recursion; this can probably be considered a good
thing, but:

- after a recursion error has been raised (technically a RuntimeError), usually
some code has to do cleanup after noticing the exception; this cleanup now can
very easily bump into the recursion limit again, due to the point mentioned
above (the funniest example of this is PyErr_ExceptionMatches, which makes a
call to PyObject_IsSubclass which itself increases the recursion count because
__subclasscheck__ can be recursively invoked...).

- to counter the latter problem, py3k has introduced a somewhat smarter
mechanism (which I've tracked down to a commit in the defunct p3yk branch by
Martin): when the recursion limit is exceeded, a special flag named
"overflowed" is set in the thread state structure which disables the primary
recursion check, so that cleanup code has a bit of room to increase the
recursion count a bit. A secondary recursion check exists (equal to the primary
one /plus/ 50) and, if it is reached, the interpreter aborts with a fatal error.
The "overflowed" flag is cleared when the recursion count drops below the
primary recursion limit /minus/ 50. Now it looks rather smart but:

- unfortunately, some functions inside the interpreter discard every exception
by design. The primary example is PyDict_GetItem(), which is certainly used
quite a lot :-)... When PyDict_GetItem() returns NULL, the caller can only
assume that the key isn't in the dict, it has no way to know that there was a
critical problem due to a recursion overflow.

I encountered the latter problem when trying to backport the py3k recursion
overflow algorithm to trunk. A fatal error suddenly appeared in test_cpickle,
and it turned out that the recursion count was exceeded in
PyObject_RichCompare(), the error was then cleared in PyDict_GetItem(), but the
"overflowed" flag was still set so that a subsequent recursion overflow would
trigger the secondary check and lead to the fatal error.

I guess that, if it doesn't happen in py3k, it's just by chance: the recursion
overflow is probably happening at another point where errors don't get
discarded. Indeed, the failure I got on trunk was manifesting itself when
running "regrtest.py test_cpickle" but not directly "test_cpickle.py"... which
shows how delicate the recursion mechanism has become.

My attempt to solve the latter problem while still backporting the py3k scheme
involves clearing the "overflowed" flag in PyErr_Clear(). This makes all tests
pass ok, but also means the "overflowed" flag loses a lot of its meaning...
since PyErr_Clear() is called in a lot of places (and, especially, in
PyDict_GetItem()).

Also, at this point I fear that the solution to the problem is becoming,
because of its complexity, perhaps worse than the problem itself. That's why
I'm bringing it here, to have your opinion.

(I also suggest that we stop trying to fix recursion checking bugs until the
stable release, so as to give us some time to do the Right Thing later - if
there is such a thing)

Regards

Antoine.