Evil reference cycles caused Exception.__traceback__
Hi, Python 3 added a __traceback__ attribute to exception objects. I guess that it was added to be able to get the original traceback when an exception is re-raised. Artifical example (real code is more complex with subfunctions and conditional code): try: ... except Exception as exc: ... raise exc # keep original traceback The problem is that Exception.__traceback__ creates reference cycles. If you store an exception in a local variable, suddently, all local variables of all frames (of the current traceback) become part of a giant reference cycle. Sometimes all these variables are kept alive very long (until you exit Python?), at least, much longer than the "expected" lifetime (destroyed "magically" local variables when you exit the function). The reference cycle is: 1) frame -> local variable -> variable which stores the exception 2) exception -> traceback -> frame exception -> ... -> frame -> ... -> same exception Breaking manually the reference cycle is complex. First, you must be aware of the reference cycle! Second, you have to identify which functions of your large application create the reference cycle: this task is long and painful even with good tooling. Finally, you have to explicitly clear variables or attributes to break the reference cycle manually. asyncio.Future.set_exception() keeps an exception object and its traceback object alive: asyncio creates reference cycles *by design*. Enjoy! asyncio tries hard to reduce the consequence of reference cycles, or even try to break cycles, using hacks like "self = None" in methods... Setting self to None is really surprising and requires a comment explainaing the hack. Last years, I fixed many reference cycles in various parts of the Python 3 standard library. Sometimes, it takes years to become aware of the reference cycle and finally fix it. For example, recently, I worked on fixing all "dangling threads" leaked by tests of the Python test suite, and I found and fixed many reference cycles which probably existed since Python 3 was created (forked from Python 2): * socket.create_connection(): commit acb9fa79fa6453c2bbe3ccfc9cad2837feb90093, bpo-31234 * concurrent.futures.ThreadPoolExecutor: commit bc61315377056fe362b744d9c44e17cd3178ce54, bpo-31249 * pydoc: commit 4cab2cd0c05fcda5fcb128c9eb230253fff88c21, bpo-31238 * xmlrpc.server: commit 84524454d0ba77d299741c47bd0f5841ac3f66ce, bpo-31247 Other examples: * test_ssl: commit 868710158910fa38e285ce0e6d50026e1d0b2a8c, bpo-31323 * test_threading: commit 3d284c081fc3042036adfe1bf2ce92c34d743b0b, bpo-31234 Another example of a recent change fixing a reference cycle, by Antoine Pitrou: * multiprocessing: commit 79d37ae979a65ada0b2ac820279ccc3b1cd41ba6, bpo-30775 For socket.create_connection(), I discovered the reference cycle because a test started to log a warning about dangling thred. The warning was introduced indirectly by a change which modified support.HOST value from '127.0.0.1' to 'localhost'... It's hard to see to link between support.HOST value and a reference cycle. Full story: https://bugs.python.org/issue29639#msg302087 Again, it's just yet another random example of a very tricky reference cycle bug caused by Exception.__traceback__. Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-) A more practical solution would be to log a warning. Maybe the garbage collector can emit a warning if it detects an exception part of a reference cycle? Or maybe detect frames? If the GC cannot do it, maybe we might use a debug thread (enabled manually) which checks manually if an exception is part of a reference cycle using gc.get_objects(): check if an exception remains alive longer than X seconds? I had the same idea for asyncio, to detect reference cycles or if a task is never "awaited", but I never implemented the idea. Victor
On Mon, 18 Sep 2017 11:31:12 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
The real issue is not reference cycles, but the fact that a traceback frame keeps all locals alive. When the frame's execution is finished, that information is practically almost always useless, but is kept for the rare cases of debugging (and perhaps dubious features such as some of py.test's magic). So we're constantly paying the price of this minor, albeit probably useful (*), debugging-related feature with annoying object lifetime issues that are sometimes quite difficult to track down and solve. (*) (I've hardly ever used pdb so I don't know if people often examine the locals of finished frames, but I suppose some do) Perhaps it would be useful to have a way to tell the interpreter to automatically clear all locals on frames when they have finished executing. It could be a thread-local setting (or, better, a PEP 550 context variable setting). Regards Antoine.
On 18 September 2017 at 19:48, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 18 Sep 2017 11:31:12 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
The real issue is not reference cycles, but the fact that a traceback frame keeps all locals alive. When the frame's execution is finished, that information is practically almost always useless, but is kept for the rare cases of debugging (and perhaps dubious features such as some of py.test's magic).
As another use case, IPython tracebacks will include details of referenced local variables, and https://docs.python.org/3/library/traceback.html#tracebackexception-objects offers the ability to readily capture the repr() of locals referenced from a traceback for the same reason.
So we're constantly paying the price of this minor, albeit probably useful (*), debugging-related feature with annoying object lifetime issues that are sometimes quite difficult to track down and solve.
(*) (I've hardly ever used pdb so I don't know if people often examine the locals of finished frames, but I suppose some do)
Perhaps it would be useful to have a way to tell the interpreter to automatically clear all locals on frames when they have finished executing. It could be a thread-local setting (or, better, a PEP 550 context variable setting).
I wonder if it might be reasonable to have tracebacks only hold a weak reference to their frame objects when "__debug__ == False". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
2017-09-18 12:07 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
I wonder if it might be reasonable to have tracebacks only hold a weak reference to their frame objects when "__debug__ == False".
Please don't change the Python behaviour depending on __debug__, or it will be a nightmare to debug it :-( ("Why does it work on my computer?") Victor
On 18 September 2017 at 20:18, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-09-18 12:07 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
I wonder if it might be reasonable to have tracebacks only hold a weak reference to their frame objects when "__debug__ == False".
Please don't change the Python behaviour depending on __debug__, or it will be a nightmare to debug it :-( ("Why does it work on my computer?")
Yeah, that's always a challenge :) Rather than being thread local or context local state, whether or not to keep the full frames in the traceback could be a yet another setting on the *exception* object, whereby we tweaked the logic that drops the reference at the end of an except clause as follows: 1. If the exception ref count == 1, just drop the reference and let the traceback die with it 2. If the exception ref count > 1, check for a new "__keep_locals__" attribute 3. If it's true, just drop the reference and otherwise leave the exception alone 4. If it's false, recursively clear all the already terminated frames in __traceback__, __context__, and __cause__ For now, we'd default to "__keep_locals__ = True", but we'd attempt to figure out some way to migrate to the default being "__keep_locals__ = False". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Mon, 18 Sep 2017 20:35:02 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 18 September 2017 at 20:18, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-09-18 12:07 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
I wonder if it might be reasonable to have tracebacks only hold a weak reference to their frame objects when "__debug__ == False".
Please don't change the Python behaviour depending on __debug__, or it will be a nightmare to debug it :-( ("Why does it work on my computer?")
Yeah, that's always a challenge :)
Rather than being thread local or context local state, whether or not to keep the full frames in the traceback could be a yet another setting on the *exception* object, whereby we tweaked the logic that drops the reference at the end of an except clause as follows:
That doesn't solve the problem, since the issue is that exceptions can be raised (and then silenced) in many places, and you don't want such exception-raising code (such as socket.create_connection) to start having to set an option on the exceptions it raises. Regards Antoine.
On 18 September 2017 at 20:52, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 18 Sep 2017 20:35:02 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Rather than being thread local or context local state, whether or not to keep the full frames in the traceback could be a yet another setting on the *exception* object, whereby we tweaked the logic that drops the reference at the end of an except clause as follows:
That doesn't solve the problem, since the issue is that exceptions can be raised (and then silenced) in many places, and you don't want such exception-raising code (such as socket.create_connection) to start having to set an option on the exceptions it raises.
Bleh, in trying to explain why my proposal would be sufficient to break the problematic cycles, I realised I was wrong: if we restrict the frame clearing to already terminated frames (as would be necessary to avoid breaking any still executing functions), then that means we won't clear the frame running the exception handler, and that's the frame that creates the problematic cyclic reference. However, I still think it makes more sense to focus on the semantics of preserving an exception beyond the life of the stack being unwound, rather than on the semantics of raising the exception in the first place. In the usual case, the traceback does keep the whole stack alive while the stack is being unwound, but then the exception gets thrown away at the end when sys.exc_info() gets reset back to (None, None, None), and then all the frames still get cleaned up fairly promptly (this is also the case in Python 2). We only get problems when one of the exception handlers in the stack grabs an additional reference to either the traceback or the exception and hence creates a persistent cyclic reference from one of the frames back to itself. The difference in Python 3 is that saving the exception is reasonably common, while explicitly saving the traceback is relatively rare, so the "exc.__traceback__" is keeping tracebacks alive that would otherwise have been cleaned up more deterministically. Putting the problem that way gave me an idea, though: what if, when the interpreter was setting "sys.exc_info()" back to (None, None, None) (or otherwise dropping an exception instance from being the "currently active exception") it automatically set exc.__traceback__ to None? That way, if you wanted the *traceback* (rather than just the exception) to live beyond the stack being unwound, you'd have to preserve the entire sys.exc_info() triple (or at least save "exc.__traceback__" separately from "exc"). By doing that, we'd have the opportunity to encourage folks that are considering preserving the entire traceback to extract a TracebackException instead and some themselves from some potentially nasty reference management issues: https://docs.python.org/3/library/traceback.html#tracebackexception-objects Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Mon, 18 Sep 2017 20:07:42 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 18 September 2017 at 19:48, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 18 Sep 2017 11:31:12 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
The real issue is not reference cycles, but the fact that a traceback frame keeps all locals alive. When the frame's execution is finished, that information is practically almost always useless, but is kept for the rare cases of debugging (and perhaps dubious features such as some of py.test's magic).
As another use case, IPython tracebacks will include details of referenced local variables, and https://docs.python.org/3/library/traceback.html#tracebackexception-objects offers the ability to readily capture the repr() of locals referenced from a traceback for the same reason.
True... but interactive use has different concerns than production use (hence my proposition of a setting to change frame cleanup behaviour). Besides, IPython also stores the history of displayed values, which is another cause of keeping objects alive :-) Regards Antoine.
On Mon, Sep 18, 2017 at 2:48 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 18 Sep 2017 11:31:12 +0200 Victor Stinner <victor.stinner@gmail.com> wrote:
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
The real issue is not reference cycles, but the fact that a traceback frame keeps all locals alive. When the frame's execution is finished, that information is practically almost always useless, but is kept for the rare cases of debugging (and perhaps dubious features such as some of py.test's magic). So we're constantly paying the price of this minor, albeit probably useful (*), debugging-related feature with annoying object lifetime issues that are sometimes quite difficult to track down and solve.
(*) (I've hardly ever used pdb so I don't know if people often examine the locals of finished frames, but I suppose some do)
Yes, that's what `pdb.pm()` is for.
Perhaps it would be useful to have a way to tell the interpreter to automatically clear all locals on frames when they have finished executing. It could be a thread-local setting (or, better, a PEP 550 context variable setting).
In Python 2 the traceback was not part of the exception object because there was (originally) no cycle GC. In Python GC we changed the awkward interface to something more useful, because we could depend on GC. Why are we now trying to roll back this feature? We should just improve GC. (Or perhaps you shouldn't be raising so many exceptions. :-) -- --Guido van Rossum (python.org/~guido)
Le 18/09/2017 à 16:52, Guido van Rossum a écrit :
In Python 2 the traceback was not part of the exception object because there was (originally) no cycle GC. In Python GC we changed the awkward interface to something more useful, because we could depend on GC. Why are we now trying to roll back this feature? We should just improve GC. (Or perhaps you shouldn't be raising so many exceptions. :-)
Improving the GC is obviously a good thing, but what heuristic would you have in mind that may solve the issue at hand? Regards Antoine.
On Sep 18, 2017 07:58, "Antoine Pitrou" <antoine@python.org> wrote: Le 18/09/2017 à 16:52, Guido van Rossum a écrit :
In Python 2 the traceback was not part of the exception object because there was (originally) no cycle GC. In Python GC we changed the awkward interface to something more useful, because we could depend on GC. Why are we now trying to roll back this feature? We should just improve GC. (Or perhaps you shouldn't be raising so many exceptions. :-)
Improving the GC is obviously a good thing, but what heuristic would you have in mind that may solve the issue at hand? I read the whole thread and I'm not sure what the issue at hand is :-). Obviously it's nice when the refcount system is able to implicitly clean things up in a prompt and deterministic way, but there are already tools to handle the cases where it doesn't (ResourceWarning, context managers, ...), and the more we encourage people to implicitly rely on refcounting, the harder it is to optimize the interpreter or use alternative language implementations. Why are reference cycles a problem that needs solving? Actually the bit that I find most confusing about Victor's story is, how can a traceback frame keep a thread alive? Is the problem that "dangling thread" warnings are being triggered by threads that are finished and dead but their Thread objects are still allocated? Because if so then that seems like a bug in the warnings mechanism; there's no harm in a dead Thread hanging around until collected, and Victor may have wasted a day debugging an issue that wasn't a problem in the first place... -n
On Mon, 18 Sep 2017 09:42:45 -0700 Nathaniel Smith <njs@pobox.com> wrote:
Obviously it's nice when the refcount system is able to implicitly clean things up in a prompt and deterministic way, but there are already tools to handle the cases where it doesn't (ResourceWarning, context managers, ...), and the more we encourage people to implicitly rely on refcounting, [...]
The thing is, we don't need to encourage them. Having objects disposed of when the last visible reference vanishes is a pretty common expectation people have when using CPython.
Why are reference cycles a problem that needs solving?
Because sometimes they are holding up costly resources in memory when people don't expect them to. Such as large Numpy arrays :-) And, no, there are no obvious ways to fix for users. gc.collect() is much too costly to be invoked on a regular basis.
Because if so then that seems like a bug in the warnings mechanism; there's no harm in a dead Thread hanging around until collected, and Victor may have wasted a day debugging an issue that wasn't a problem in the first place...
Yes, I think Victor is getting a bit overboard with the so-called "dangling thread" issue. But the underlying issue (that heavyweight resources can be inadvertently held up in memory up just because some unrelated exception was caught and silenced along the way) is a real one. Regards Antoine.
On Mon, Sep 18, 2017 at 9:50 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 18 Sep 2017 09:42:45 -0700 Nathaniel Smith <njs@pobox.com> wrote:
Obviously it's nice when the refcount system is able to implicitly clean things up in a prompt and deterministic way, but there are already tools to handle the cases where it doesn't (ResourceWarning, context managers, ...), and the more we encourage people to implicitly rely on refcounting, [...]
The thing is, we don't need to encourage them. Having objects disposed of when the last visible reference vanishes is a pretty common expectation people have when using CPython.
Why are reference cycles a problem that needs solving?
Because sometimes they are holding up costly resources in memory when people don't expect them to. Such as large Numpy arrays :-)
Do we have any reason to believe that this is actually happening on a regular basis though? If it is then it might make sense to look at the cycle collection heuristics; IIRC they're based on a fairly naive count of how many allocations have been made, without regard to their size.
And, no, there are no obvious ways to fix for users. gc.collect() is much too costly to be invoked on a regular basis.
Because if so then that seems like a bug in the warnings mechanism; there's no harm in a dead Thread hanging around until collected, and Victor may have wasted a day debugging an issue that wasn't a problem in the first place...
Yes, I think Victor is getting a bit overboard with the so-called "dangling thread" issue. But the underlying issue (that heavyweight resources can be inadvertently held up in memory up just because some unrelated exception was caught and silenced along the way) is a real one.
Simply catching and silencing exceptions doesn't create any loops -- if you do try: raise ValueError except ValueError as exc: raise exc then there's no loop, because the 'exc' local gets cleared as soon as you exit the except: block. The issue that Victor ran into with socket.create_connection is a special case where that function saves off the caught exception to use later. If someone wanted to replace socket.create_connection's 'raise err' with try: raise err finally: del err then I guess that would be pretty harmless... -n -- Nathaniel J. Smith -- https://vorpus.org
Le 18/09/2017 à 19:53, Nathaniel Smith a écrit :
Why are reference cycles a problem that needs solving?
Because sometimes they are holding up costly resources in memory when people don't expect them to. Such as large Numpy arrays :-)
Do we have any reason to believe that this is actually happening on a regular basis though?
Define "regular" :-) We did get some reports on dask/distributed about it.
If it is then it might make sense to look at the cycle collection heuristics; IIRC they're based on a fairly naive count of how many allocations have been made, without regard to their size.
Yes... But just because a lot of memory has been allocated isn't a good enough heuristic to launch a GC collection. What if that memory is gonna stay allocated for a long time? Then you're frequently launching GC runs for no tangible result except more CPU consumption and frequent pauses. Perhaps we could special-case tracebacks somehow, flag when a traceback remains alive after the implicit "del" clause at the end of an "except" block, then maintain some kind of linked list of the flagged tracebacks and launch specialized GC runs to find cycles accross that collection. That sounds quite involved, though.
The issue that Victor ran into with socket.create_connection is a special case where that function saves off the caught exception to use later.
That's true... but you can find such special cases in an assorted bunch of library functions (stdlib or third-party), not just socket.create_connection(). Fixing them one by one is always possible, but it's a never-ending battle. Regards Antoine.
On Mon, Sep 18, 2017 at 10:59 AM, Antoine Pitrou <antoine@python.org> wrote:
Le 18/09/2017 à 19:53, Nathaniel Smith a écrit :
Why are reference cycles a problem that needs solving?
Because sometimes they are holding up costly resources in memory when people don't expect them to. Such as large Numpy arrays :-)
Do we have any reason to believe that this is actually happening on a regular basis though?
Define "regular" :-) We did get some reports on dask/distributed about it.
Caused by uncollected cycles involving tracebacks? I looked here: https://github.com/dask/distributed/issues?utf8=%E2%9C%93&q=is%3Aissue%20memory%20leak and saw some issues with cycles causing delayed collection (e.g. #956) or the classic memory leak problem of explicitly holding onto data you don't need any more (e.g. #1209, bpo-29861), but nothing involving traceback cycles. It was just a quick skim though.
If it is then it might make sense to look at the cycle collection heuristics; IIRC they're based on a fairly naive count of how many allocations have been made, without regard to their size.
Yes... But just because a lot of memory has been allocated isn't a good enough heuristic to launch a GC collection.
I'm not an expert on GC at all, but intuitively it sure seems like allocation size might be a useful piece of information to feed into a heuristic. Our current heuristic is just, run a small collection after every 700 allocations, run a larger collection after 10 smaller collections.
What if that memory is gonna stay allocated for a long time? Then you're frequently launching GC runs for no tangible result except more CPU consumption and frequent pauses.
Every heuristic has problematic cases, that's why we call it a heuristic :-). But somehow every other GC language manages to do well-enough without refcounting... I think they mostly have more sophisticated heuristics than CPython, though. Off the top of my head, I know PyPy's heuristic involves the ratio of the size of nursery objects versus the size of the heap, and JVMs do much cleverer things like auto-tuning nursery size to make empirical pause times match some target.
Perhaps we could special-case tracebacks somehow, flag when a traceback remains alive after the implicit "del" clause at the end of an "except" block, then maintain some kind of linked list of the flagged tracebacks and launch specialized GC runs to find cycles accross that collection. That sounds quite involved, though.
We already keep a list of recently allocated objects and have a specialized GC that runs across just that collection. That's what generational GC is :-). -n -- Nathaniel J. Smith -- https://vorpus.org
On Mon, 18 Sep 2017 13:25:47 -0700 Nathaniel Smith <njs@pobox.com> wrote:
If it is then it might make sense to look at the cycle collection heuristics; IIRC they're based on a fairly naive count of how many allocations have been made, without regard to their size.
Yes... But just because a lot of memory has been allocated isn't a good enough heuristic to launch a GC collection.
I'm not an expert on GC at all, but intuitively it sure seems like allocation size might be a useful piece of information to feed into a heuristic. Our current heuristic is just, run a small collection after every 700 allocations, run a larger collection after 10 smaller collections.
Ok, so there are two useful things a heuristic may try to guess: 1. how much CPU time a (full) collection will cost 2. how much memory it might help release While #1 is quite realistically predictable (because a full collection is basically O(number of allocated objects)), #2 is entirely unpredictable and you can really only make out an upper bound (trivially, the GC cannot release any more memory than has been allocated :-)). Our current heuristic is geared at estimating #1, so that the proportional time taken by GC runs in any long-running program is roughly bounded.
Every heuristic has problematic cases, that's why we call it a heuristic :-). But somehow every other GC language manages to do well-enough without refcounting...
I don't know, define "well enough"? :-) For example, Java is well-known for requiring manual tuning of GC parameters for heavy workloads (or at least it was). I would be curious how well Java deals when someone does numerical crunching on large chunks of data, then throws that data away, for example. Apparently R also uses a garbage collector, and ironically you can find places on the Internet where people recommend you call gc() to avoid thrashing memory. https://stackoverflow.com/questions/8813753/what-is-the-difference-between-g... On CPython at least, once you lose the last reference to your multi-gigabyte Numpy array, memory is returned immediately...
I think they mostly have more sophisticated heuristics than CPython, though.
I'm sure some of them do. They also tend to have many more man-hours available than we do to optimize the hell out of their GC.
Perhaps we could special-case tracebacks somehow, flag when a traceback remains alive after the implicit "del" clause at the end of an "except" block, then maintain some kind of linked list of the flagged tracebacks and launch specialized GC runs to find cycles accross that collection. That sounds quite involved, though.
We already keep a list of recently allocated objects and have a specialized GC that runs across just that collection. That's what generational GC is :-).
Well, you can cross your fingers that the traceback is still in the youngest generations when it finally becomes collectable. But if the traceback stays alive for too long, it ends up in the oldest generation and a long time can pass before it gets collected. Regards Antoine.
Hi, just curious on this, On 09/18/2017 10:54 PM, Antoine Pitrou wrote: >> I'm not an expert on GC at all, but intuitively it sure seems like >> allocation size might be a useful piece of information to feed into a >> heuristic. Our current heuristic is just, run a small collection after >> every 700 allocations, run a larger collection after 10 smaller >> collections. > Ok, so there are two useful things a heuristic may try to guess: > 1. how much CPU time a (full) collection will cost > 2. how much memory it might help release > > While #1 is quite realistically predictable (because a full collection > is basically O(number of allocated objects)), #2 is entirely > unpredictable and you can really only make out an upper bound > (trivially, the GC cannot release any more memory than has been > allocated :-)). is the allocation size not useful enough for #2 ? (the upper bound seems logical as far as the location is done by the same "entity", (or size communicated back to) ) Thanks in advance! --francis
Hum, my email is long. Maybe an short example is simpler to understand: --- import gc class VerboseDestructor: # Imagine an object using many limited resources like memory, # sockets and/or files def __del__(self): print("VerboseDestructor") def create_ref_cycle(): obj = VerboseDestructor() # Create a reference cycle using an exception # to keep the frame alive try: raise ValueError except ValueError as exc: err = exc print("create_ref_cycle") create_ref_cycle() print("gc.collect") gc.collect() print("exit") --- Output: --- create_ref_cycle gc.collect VerboseDestructor exit --- create_ref_cycle() function creates a reference cycle causing the "obj" variable to be kept alive after create_ref_cycle() exit. Let's say that the "leaked" object uses a lot of memory. The bug can now be called a memory leak. Let's say that the leaked object uses a file. Now the risk is that the file cannot be removed anymore on Windows (you cannot remove an open file), or that data is not properly flushed on disk, or that the same file is re-opened by a different function in a different thread and so you might corrupt data, etc. It's not hard to imagine many various issues caused by an object kept alive longer than expected. Well, you *can* modify your code to add a VerboseDestructor.close() method and explicitly releases resources as soon as possible in create_ref_cycle(). Maybe by adding support for the context manager protocol to VerboseDestructor and use "with obj:" in create_ref_cycle(). But here you expect that you are already aware of the reference cycle. Moreover, Python 3 haters can complain that the code worked perfectly fine in Python 2... Victor
On 18 September 2017 at 09:31, Victor Stinner <victor.stinner@gmail.com> wrote:
Last years, I fixed many reference cycles in various parts of the Python 3 standard library. Sometimes, it takes years to become aware of the reference cycle and finally fix it.
For example, recently, I worked on fixing all "dangling threads" leaked by tests of the Python test suite, and I found and fixed many reference cycles which probably existed since Python 3 was created (forked from Python 2)
My instinct is to suggest to make the “dangling threads” test tolerate Thread objects that are no longer running. Last time I looked, it tests a list of weak references to Thread objects. But maybe it should check “Thread.is_alive” or use “threading.enumerate”.
. . .
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
A more practical solution would be to log a warning. Maybe the garbage collector can emit a warning if it detects an exception part of a reference cycle? Or maybe detect frames?
The “gc” module can do some of this already. In the past (when __del__ used to prevent garbage collection) I used DEBUG_SAVEALL to and graphed the result to figure out where code was creating reference cycles. Or maybe if you turned on DEBUG_COLLECTABLE and filtered the objects printed out, or played with the new “gc.callbacks” API.
If the GC cannot do it, maybe we might use a debug thread (enabled manually) which checks manually if an exception is part of a reference cycle using gc.get_objects(): check if an exception remains alive longer than X seconds? I had the same idea for asyncio, to detect reference cycles or if a task is never "awaited", but I never implemented the idea.
Thanks for working on this and writing up the details Victor. For those confused about why this matters, routinely having every object in your application participating in one (or more) giant reference cycles makes reasoning about and fixing resource issues very difficult. For instance, a while back I made some changes in Bazaar so that each TestCase instance was dereferenced after being run: <https://bugs.launchpad.net/bzr/+bug/613247> Which mattered when running ~25000 tests, to keep peak memory usage sane. Also there were various other object lifetime issues, and tracking down which specific test failed to join a thread, or close a file on Windows was much simpler after making sure cleanup actually happened at the time a test was deleted. Keeping more stuff alive for longer periods also makes peak memory requirements higher, and the gc has to do more work each time. On 18/09/2017, Victor Stinner <victor.stinner@gmail.com> wrote:
Ideally, CPython 3.x should never create reference cycles. Removing Exception.__traceback__ is the obvious "fix" for the issue. But I expect that slowly, a lot of code started to rely on the attribute, maybe even for good reasons :-)
A more practical solution would be to log a warning. Maybe the garbage collector can emit a warning if it detects an exception part of a reference cycle? Or maybe detect frames?
Logging a warning is unlikely to be practical. I had an optional test flag that complained about reference cycles, and it produced a lot of output. Martin
participants (9)
-
Antoine Pitrou
-
Antoine Pitrou
-
francismb
-
Guido van Rossum
-
Martin (gzlist)
-
Martin Panter
-
Nathaniel Smith
-
Nick Coghlan
-
Victor Stinner