... The other thing to note is that untracking such tuples is a very small win regardless - if they were not untracked the tuple visit routine will quickly scan down the tuple (which I guess is likely to be small to start with), and the GC visitor will quickly reject adding any of the contained objects because they are not PyObject_IS_GC().
I don't know that it's small. It's a function call per tuple scan, and apps slinging millions of tuples aren't rare: just touching the memory to scan all of them is expensive.
That suggests that the only case it would make a difference is a) when there are soooo many untrackable tuples that simply traversing them takes a large amount of time (ie. the iterzip case), or b) when there is a massive nesting of untrackable tuples - an large AST or binary-tree might benefit from this, but there are perhaps better solutions to that problem (see below)
... I still don't understand how you 'check the tuple' and then untrack it - even if it has no NULL slots it still isn't safe to remove.
It's very likely safe, and if we do this I can easily guarantee it's 100% safe for the core distribution. It's not possible to create a tuple with NULL slots from Python code, so it's just a matter of checking the core C code that creates tuples. Most of it is utterly and obviously safe. A little bit isn't. For example, there's a very small hole in PySequence_Tuple() then, when the allocated tuple turns out not to be big enough: then invoking the input sequence iterator can trigger gc, and that might erroneously untrack the tuple being constructed. But that hole is easily plugged, and I expect that all such holes are just as easily plugged.
Extension modules aren't supposed to call _PyTuple_Resize() at all (it's part of the private API), and it's very unlikely that anything not calling that routine is vulnerable. It's only called from 5 places in the core, and 2 of them are in PySequence_Tuple. Still, it would be a new restriction, and may fool something *somewhere*. If it does, and they're tuples that turn out to be in a cycle (itself rare, especially for tuples that grow), they may get a leak. BFD <wink>.
I don't think I agree that special casing tuples in the collector is a good idea - this whole issue just came up because of the iterzip test, where it was noticed that the collector really didn't have to check all the objects, which is just a special case solution to the problem that many of the objects in the gen* lists are not going to be participating in a cycle.
As above, apps slinging millions of tuples aren't rare, and it's not doing anyone any good in cases where they survive an unbounded number of collections. They may be important not because it's pretty clear how to exempt them, but because they may be important <0.9 wink>.
A solution to that general problem would be much more beneficial, although presumably much more work ;)
If the collector could somehow prioritize objects, it seems like various heuristics could be used to gauge the likelyhood of an object participating in a cycle vs. the 'good' of collecting that cycle.
Neil has since generalized the gc module to make it easy to add more generations. That's a general and effective approach to reducing useless scans. We're never going to have anything like, say, write barriers in Python, and OS-specific and platform-specific tricks are right out. GC in Python lives within many severe constraints.
As an aside: are there some nice established tests for estimating the benefit of modifications to the collector? Something that needs the collector (so getting cycles early keeps the memory consumption down, improving performance) and also has dynamic object allocation properties that mimic some ideal of real-applications?
Not that I know of. The cycle gc has performed (IMO), overall, very well since its inception. Improvements are driven more by the rare pathological cases that pop up than by systematic experimentation. Neil seems to use one of his large apps to "try things out on", and that's as close to an ideal real-life app as I expect anyone is going to hand Neil. Glitches tend to show up as timing anomalies (iterzip was typical in that respect).