[Python-Dev] Evil reference cycles caused Exception.__traceback__

Antoine Pitrou solipsis at pitrou.net
Mon Sep 18 16:54:03 EDT 2017


On Mon, 18 Sep 2017 13:25:47 -0700
Nathaniel Smith <njs at pobox.com> wrote:
> >> If it is then it might make sense to look at the cycle collection
> >> heuristics; IIRC they're based on a fairly naive count of how many
> >> allocations have been made, without regard to their size.  
> >
> > Yes... But just because a lot of memory has been allocated isn't a good
> > enough heuristic to launch a GC collection.  
> 
> I'm not an expert on GC at all, but intuitively it sure seems like
> allocation size might be a useful piece of information to feed into a
> heuristic. Our current heuristic is just, run a small collection after
> every 700 allocations, run a larger collection after 10 smaller
> collections.

Ok, so there are two useful things a heuristic may try to guess:
1. how much CPU time a (full) collection will cost
2. how much memory it might help release

While #1 is quite realistically predictable (because a full collection
is basically O(number of allocated objects)), #2 is entirely
unpredictable and you can really only make out an upper bound
(trivially, the GC cannot release any more memory than has been
allocated :-)).

Our current heuristic is geared at estimating #1, so that the
proportional time taken by GC runs in any long-running program is
roughly bounded.

> Every heuristic has problematic cases, that's why we call it a
> heuristic :-). But somehow every other GC language manages to do
> well-enough without refcounting...

I don't know, define "well enough"? :-) For example, Java is well-known
for requiring manual tuning of GC parameters for heavy workloads (or
at least it was).

I would be curious how well Java deals when someone does numerical
crunching on large chunks of data, then throws that data away, for
example.

Apparently R also uses a garbage collector, and ironically you can find
places on the Internet where people recommend you call gc() to avoid
thrashing memory.
https://stackoverflow.com/questions/8813753/what-is-the-difference-between-gc-and-rm

On CPython at least, once you lose the last reference to your
multi-gigabyte Numpy array, memory is returned immediately...

> I think they mostly have more
> sophisticated heuristics than CPython, though.

I'm sure some of them do.  They also tend to have many more man-hours
available than we do to optimize the hell out of their GC.

> > Perhaps we could special-case tracebacks somehow, flag when a traceback
> > remains alive after the implicit "del" clause at the end of an "except"
> > block, then maintain some kind of linked list of the flagged tracebacks
> > and launch specialized GC runs to find cycles accross that collection.
> > That sounds quite involved, though.  
> 
> We already keep a list of recently allocated objects and have a
> specialized GC that runs across just that collection. That's what
> generational GC is :-).

Well, you can cross your fingers that the traceback is still in the
youngest generations when it finally becomes collectable.  But if the
traceback stays alive for too long, it ends up in the oldest generation
and a long time can pass before it gets collected.

Regards

Antoine.




More information about the Python-Dev mailing list