[Python-Dev] Lazily GC tracking tuples

Tim Peters tim_one@email.msn.com
Wed, 8 May 2002 22:39:49 -0400

[Daniel Dunbar]
> ...
> The other thing to note is that untracking such tuples is a
> very small win regardless - if they were not untracked the
> tuple visit routine will quickly scan down the tuple (which
> I guess is likely to be small to start with), and the GC
> visitor will quickly reject adding any of the contained
> objects because they are not PyObject_IS_GC().

I don't know that it's small.  It's a function call per tuple scan, and apps
slinging millions of tuples aren't rare:  just touching the memory to scan
all of them is expensive.

> That suggests that the only case it would make a difference
> is a) when there are soooo many untrackable tuples that
> simply traversing them takes a large amount of time (ie.
> the iterzip case), or b) when there is a massive nesting
> of untrackable tuples - an large AST or binary-tree might
> benefit from this, but there are perhaps better solutions
> to that problem (see below)


> ...
> I still don't understand how you 'check the tuple' and then
> untrack it - even if it has no NULL slots it still isn't safe
> to remove.

It's very likely safe, and if we do this I can easily guarantee it's 100%
safe for the core distribution.  It's not possible to create a tuple with
NULL slots from Python code, so it's just a matter of checking the core C
code that creates tuples.  Most of it is utterly and obviously safe.  A
little bit isn't.  For example, there's a very small hole in
PySequence_Tuple() then, when the allocated tuple turns out not to be big
enough:  then invoking the input sequence iterator can trigger gc, and that
might erroneously untrack the tuple being constructed.  But that hole is
easily plugged, and I expect that all such holes are just as easily plugged.

Extension modules aren't supposed to call _PyTuple_Resize() at all (it's
part of the private API), and it's very unlikely that anything not calling
that routine is vulnerable.  It's only called from 5 places in the core, and
2 of them are in PySequence_Tuple.  Still, it would be a new restriction,
and may fool something *somewhere*.  If it does, and they're tuples that
turn out to be in a cycle (itself rare, especially for tuples that grow),
they may get a leak.  BFD <wink>.

> I don't think I agree that special casing tuples in the
> collector is a good idea - this whole issue just came
> up because of the iterzip test, where it was noticed
> that the collector really didn't have to check all
> the objects, which is just a special case solution to
> the problem that many of the objects in the gen*
> lists are not going to be participating in a cycle.

As above, apps slinging millions of tuples aren't rare, and it's not doing
anyone any good in cases where they survive an unbounded number of
collections.  They may be important not because it's pretty clear how to
exempt them, but because they may be important <0.9 wink>.

> A solution to that general problem would be much more
> beneficial, although presumably much more work ;)
> If the collector could somehow prioritize objects,
> it seems like various heuristics could be used to
> gauge the likelyhood of an object participating in
> a cycle vs. the 'good' of collecting that cycle.

Neil has since generalized the gc module to make it easy to add more
generations.  That's a general and effective approach to reducing useless
scans.  We're never going to have anything like, say, write barriers in
Python, and OS-specific and platform-specific tricks are right out.  GC in
Python lives within many severe constraints.

> As an aside: are there some nice established tests
> for estimating the benefit of modifications to
> the collector? Something that needs the collector
> (so getting cycles early keeps the memory consumption
> down, improving performance) and also has dynamic
> object allocation properties that mimic some ideal
> of real-applications?

Not that I know of.  The cycle gc has performed (IMO), overall, very well
since its inception.  Improvements are driven more by the rare pathological
cases that pop up than by systematic experimentation.  Neil seems to use one
of his large apps to "try things out on", and that's as close to an ideal
real-life app as I expect anyone is going to hand Neil.  Glitches tend to
show up as timing anomalies (iterzip was typical in that respect).