Mailman 3 PEP 556 threaded garbage collection & linear recursion in gc - Python-Dev

PEP 556 threaded garbage collection & linear recursion in gc

Gregory P. Smith

27 Mar 2019 27 Mar '19

10:59 p.m.

https://www.python.org/dev/peps/pep-0556/ This PEP is currently Deferred as nobody is actively working on a test implementation. A situation came up the other day where I *believe* this could've helped. Scenario (admittedly not one most environments run into): A Python process with a C++ extension module implementing a threaded server (threads spawned by C++) that could call back into CPython within server request handlers. (ie: how all threaded servers tend to work regardless of core loop implementation language) Python code in the application had done something (unknown to me, I didn't dive into their code) that built up large enough presumably nested or recursive data structures that the garbage collector, when triggered, would wind up in very deep recursion. This caused a stack overflow as the C++ spawned threads were only being given a 256k stack (to conserve virtual address space - there can potentially be a _ton_ of threads in this code). That had a C++ stack trace 1000+ levels deep repeating the pattern of ... @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc ... If our gc were done on a thread of its own spawned by Python, with a typical normal larger default stack size (8mb) this would've been less likely to trigger a crash (though obviously still possible if the recursion is linear). I know, there are obvious workarounds to this situation of all sorts. My point is more that synchronously triggering gc _within_ a thread that happens to invoke the periodic "hey, lets do another gc run" logic the eval loop is undesirable. I'm not making any plans to work on an implementation for this PEP, deferred seems accurate. Just dropping a note that I'd still be interested in seeing something other than synchronous gc in arbitrary threads happen when a process has multiple threads. *Another* take away from this is that it *appears* possible to cause our gc to go into linear recursion. Yuck! I'd file an issue on that, but doing so requires making some example code to construct such a scenario first... Food for thought in case these kinds of things are something anyone else has encountered. -Greg

Attachments:

attachment.htm (text/html — 2.9 KB)

Show replies by date

Tim Peters

28 Mar 28 Mar

12:38 a.m.

[Gregory P. Smith ]

...

... A situation came up the other day where I believe this could've helped.

Scenario (admittedly not one most environments run into): A Python process with a C++ extension module implementing a threaded server (threads spawned by C++) that could call back into CPython within server request handlers. (ie: how all threaded servers tend to work regardless of core loop implementation language)

Python code in the application had done something (unknown to me, I didn't dive into their code) that built up large enough presumably nested or recursive data structures that the garbage collector, when triggered, would wind up in very deep recursion. This caused a stack overflow as the C++ spawned threads were only being given a 256k stack (to conserve virtual address space - there can potentially be a _ton_ of threads in this code).

That had a C++ stack trace 1000+ levels deep repeating the pattern of

... @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc ...

If our gc were done on a thread of its own spawned by Python, with a typical normal larger default stack size (8mb) this would've been less likely to trigger a crash (though obviously still possible if the recursion is linear).

Just noting that gc is probably a red herring in that scenario. gc (cleverly!) determines the set of trash objects without needing any recursive graph crawls. And gc doesn't deallocate anything - not directly. It simply applies Py_DECREF to trash objects, once for each PyObject* found pointing to them. _If_ deallocation occurs as a result, it's the same as if the user had done `del` on the appropriate object manually. The recursion then occurs in the chain of XXX_dealloc calls, which in turn do more Py_DECREFs of their own, and which have nothing in particular to do with gc. Approximately the same stack depth would be hit in a Python built without gc if the user did the same sequence of `del`s by hand to break trash cycles. The good news is that the "trashcan" mechanism - which predates gc - was introduced to limit call depth in the presence of some potentially unbounded chains of dealloc calls. So teach trashcan about the pattern here, and the stack depth problem goes away, with or without gc. The bad news is that the traschcan mechanism is excruciating, a long-time source of subtle bugs of its own :-( Note: without actual code to run, it's possible that trashcan is _already_ trying to limit stack depth in this specific case, but the stack size is too small to accommodate the largest depth of recursive calls trashcan is willing to allow before interfering. Or some bug snuck into trashcan, or its invoking code, that causes it to fail to work as intended due to some unknown detail of the code above. There's just no way to guess without code to provoke it.

...

Another take away from this is that it appears possible to cause our gc to go into linear recursion.

As above, it's probably not gc, but deallocation as a side effect of Py_DECREF dropping a refcount to 0. gc is just one way Py_DECREF can get invoked.

Gregory P. Smith

5:23 a.m.

Good point, I hadn't considered that it was regular common ref count 0 dealloc chaining. The processes unfortunately didn't have faulthandler enabled so there wasn't much info from where in the python code it happened (now fixed). I'll see if anything looks particularly unusual next time I hear of such a report. -G On Wed, Mar 27, 2019, 5:38 PM Tim Peters wrote:

...

...
... A situation came up the other day where I believe this could've helped.

Scenario (admittedly not one most environments run into): A Python

...
with a C++ extension module implementing a threaded server (threads spawned by C++) that could call back into CPython within server request handlers. (ie: how all threaded servers tend to work regardless of core loop implementation language)

Python code in the application had done something (unknown to me, I didn't dive into their code) that built up large enough presumably nested or recursive data structures that the garbage collector, when triggered, would wind up in very deep recursion. This caused a stack overflow as the C++ spawned threads were only being given a 256k stack (to conserve virtual address space - there can potentially be a _ton_ of threads in this code).

That had a C++ stack trace 1000+ levels deep repeating the pattern of

... @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc ...

If our gc were done on a thread of its own spawned by Python, with a typical normal larger default stack size (8mb) this would've been less likely to trigger a crash (though obviously still possible if the recursion is

[Gregory P. Smith ] process linear).

Just noting that gc is probably a red herring in that scenario. gc (cleverly!) determines the set of trash objects without needing any recursive graph crawls. And gc doesn't deallocate anything - not directly. It simply applies Py_DECREF to trash objects, once for each PyObject* found pointing to them. _If_ deallocation occurs as a result, it's the same as if the user had done `del` on the appropriate object manually. The recursion then occurs in the chain of XXX_dealloc calls, which in turn do more Py_DECREFs of their own, and which have nothing in particular to do with gc. Approximately the same stack depth would be hit in a Python built without gc if the user did the same sequence of `del`s by hand to break trash cycles.

The good news is that the "trashcan" mechanism - which predates gc - was introduced to limit call depth in the presence of some potentially unbounded chains of dealloc calls. So teach trashcan about the pattern here, and the stack depth problem goes away, with or without gc.

The bad news is that the traschcan mechanism is excruciating, a long-time source of subtle bugs of its own :-(

Note: without actual code to run, it's possible that trashcan is _already_ trying to limit stack depth in this specific case, but the stack size is too small to accommodate the largest depth of recursive calls trashcan is willing to allow before interfering. Or some bug snuck into trashcan, or its invoking code, that causes it to fail to work as intended due to some unknown detail of the code above. There's just no way to guess without code to provoke it.

...
Another take away from this is that it appears possible to cause our gc to go into linear recursion.

As above, it's probably not gc, but deallocation as a side effect of Py_DECREF dropping a refcount to 0. gc is just one way Py_DECREF can get invoked.

Tim Peters

6:33 a.m.

[Gregory P. Smith ]

...

Good point, I hadn't considered that it was regular common ref count 0 dealloc chaining.

It pretty much has to be whenever you see a chain of XXX_dealloc routines in a stack trace. gcmodule.c never even looks at a tp_dealloc slot directly, let alone directly invoke a deallocation method. That all happens indirectly, as a result of what Py_DECREF does. Then once you're first inside one tp_dealloc method, gc is completely irrelevant - it's that tp_dealloc for the top-level container does its own Py_DECREF on a contained container, which in turn can do _its_ own Py_DECREF on one of its contained containers .... etc. You can get an arbitrarily deep stack of XXX_dealloc calls then, and there's really no other way to get that. BTW, "container" here is used in a very broad C-level sense, not a high-level Python sense: any PyObject that contains a pointer to a PyObject is "a container" in the intended sense.

...

The processes unfortunately didn't have faulthandler enabled so there wasn't much info from where in the python code it happened (now fixed).

It's quite possible that the top-level container was Py_DECREF'ed by code in gcmodule.c. But gc gets blamed at first for a lot of stuff that's not actually its fault ;-)

...

I'll see if anything looks particularly unusual next time I hear of such a report.

The trashcan mechanism is the one and only hack in the code intended to stop unbounded XXX_dealloc stacks, so that's what needs looking at. Alas, it's hard to work with because it's so very low-level, and there's nothing portable that can be relied on about stack sizes or requirements across platforms or compilers. Some possibilities: - The trashcan code is buggy. - The maximum container dealloc stack depth trashcan intends to allow (PyTrash_UNWIND_LEVEL = 50) is too large for the C stack a thread gets under this app on this platform using this compiler. - One or more of the specific container types involved in this app's dealloc chain doesn't use the trashcan gimmick at all, so is invisible to trashcan's count of how deep the call stack has gotten. For example, cell_dealloc was in your stack trace, but I see no use of trashcan code in that function (Py_TRASHCAN_SAFE_BEGIN / Py_TRASHCAN_SAFE_END). So the trashcan hack has no idea that cell_dealloc calls are on the stack. And likewise for func_dealloc.- looks like calls to that are also invisible to the trashcan. tupledealloc is cool, though. IIRC, Christian Tismer introduced the trashcan because code only he wrote ;-) was blowing the stack when very deeply nested lists and/or tuples became trash.

...

From a quick scan of the current code, looks like it was later added to only a few types that aren't container types in the Python sense.

Which may or may not matter here. Your stack trace showed a tupledealloc in one of every three slots, so even if two of every three slots were invisible to the traschcan, the call stack "should have been" limited to a maximum of about PyTrash_UNWIND_LEVEL * 3 = 150 XXX_dealloc functions. But you saw a stack 1000+ levels deep. So something else that isn't immediately apparent is also going on.

Jeroen Demeyer

10:06 a.m.

On 2019-03-28 01:38, Tim Peters wrote:

...

The bad news is that the traschcan mechanism is excruciating, a long-time source of subtle bugs of its own :-(

It just happens that I created a PR to fix some of the trashcan problems: see https://bugs.python.org/issue35983 and https://github.com/python/cpython/pull/11841

Antoine Pitrou

9:27 a.m.

On Wed, 27 Mar 2019 15:59:25 -0700 "Gregory P. Smith" wrote:

...

That had a C++ stack trace 1000+ levels deep repeating the pattern of

... @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc ...

As Tim said, if you still have a core dump somewhere (or can reproduce the issue) it would be nice to know why the "trashcan" mechanism didn't trigger. Regards Antoine.

Neil Schemenauer

7:39 p.m.

On 2019-03-28, Antoine Pitrou wrote:

...

On Wed, 27 Mar 2019 15:59:25 -0700 "Gregory P. Smith" wrote:

...
That had a C++ stack trace 1000+ levels deep repeating the pattern of

... @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc @ 0x564d59bd21de 32 func_dealloc @ 0x564d59bce0c1 32 cell_dealloc @ 0x564d5839db41 48 tupledealloc ...

As Tim said, if you still have a core dump somewhere (or can reproduce the issue) it would be nice to know why the "trashcan" mechanism didn't trigger.

To expand on this, every time tupledealloc gets called, Py_TRASHCAN_SAFE_BEGIN also gets invoked. It increments tstate->trash_delete_nesting. As Tim suggests, maybe PyTrash_UNWIND_LEVEL is too large given the size of the C stack frames from func_dealloc + cell_dealloc + tupledealloc. That theory seems hard to believe though, unless the C stack is quite small. I see PyTrash_UNWIND_LEVEL = 50. Perhaps the stack could have been mostly used up before the dealloc sequence started. The other option is that there is some bug in the trashcan mechanism. It certainly is some very tricky code. Regards, Neil

1855

Age (days ago)

1856

Last active (days ago)

List overview

Download

6 comments

5 participants

participants (5)

Antoine Pitrou
Gregory P. Smith
Jeroen Demeyer
Neil Schemenauer
Tim Peters

PEP 556 threaded garbage collection & linear recursion in gc

Gregory P. Smith

Tim Peters

Gregory P. Smith

Tim Peters

Jeroen Demeyer

Antoine Pitrou

Neil Schemenauer

tags

participants (5)