[Python-ideas] Deterministic iterator cleanup

Wed Oct 26 12:54:27 EDT 2016

On 26 October 2016 at 08:25, Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> At this point your code is starting to look a whole lot like the code
>> in contextlib.ExitStack.__exit__ :)
>
> One of the versions I tried but didn't include in my email used
> ExitStack :-). It turns out not to work here: the problem is that we
> effectively need to enter *all* the contexts before unwinding, even if
> trying to enter one of them fails. ExitStack is nested like (try (try
> (try ... finally) finally) finally), and we need (try finally (try
> finally (try finally ...)))

Regardless of any other outcome from this thread, it may be useful to
have a "contextlib.ResourceSet" as an abstraction for collective
management of resources, regardless of whatever else happens. As you
say, the main difference is that the invocation of the cleanup
functions wouldn't be nested at all and could be called in an
arbitrary order (if that's not sufficient for a particular use case,
then you'd need to define an ExitStack for the items where the order
of cleanup matters, and then register *that* with the ResourceSet).

>> A potentially more fruitful direction of research to pursue for 3.7
>> would be the notion of "frame local resources", where each Python
>> level execution frame implicitly provided a lazily instantiated
>> ExitStack instance (or an equivalent) for resource management.
>> Assuming that it offered an "enter_frame_context" function that mapped
>> to "contextlib.ExitStack.enter_context", such a system would let us do
>> things like:
>
> So basically a 'with expression', that gives up the block syntax --
> taking its scope from the current function instead -- in return for
> being usable in expression context? That's a really interesting, and I
> see the intuition that it might be less disruptive if our implicit
> iterclose calls are scoped to the function rather than the 'for' loop.
>
> But having thought about it and investigated some... I don't think
> function-scoping addresses my problem, and I don't see evidence that
> it's meaningfully less disruptive to existing code.
>
> First, "my problem":
>
> Obviously, Python's a language that should be usable for folks doing
> one-off scripts, and for paranoid folks trying to write robust complex
> systems, and for everyone in between -- these are all really important
> constituencies. And unfortunately, there is a trade-off here, where
> the changes we're discussing effect these constituencies differently.
> But it's not just a matter of shifting around a fixed amount of pain;
> the *quality* of the pain really changes under the different
> proposals.
>
> In the status quo:
> - for one-off scripts: you can just let the GC worry about generator
> and file handle cleanup, re-use iterators, whatever, it's cool
> - for robust systems: because it's the *caller's* responsibility to
> ensure that iterators are cleaned up, you... kinda can't really use
> generators without -- pick one -- (a) draconian style guides (like
> forbidding 'with' inside generators or forbidding bare 'for' loops
> entirely), (b) lots of auditing (every time you write a 'for' loop, go
> read the source to the generator you're iterating over -- no
> modularity for you and let's hope the answer doesn't change!), or (c)
> introducing really subtle bugs.

(Note: I've changed my preferred API name from "function_resource" +
"frame_resource" to the general purpose "scoped_resource" - while it's
somewhat jargony, which I consider unfortunate, the goal is to make
the runtime scope of the resource match the lexical scope of the
reference as closely as is feasible, and if folks are going to
understand how Python manages references and resources, they're going
to need to learn the basics of Python's scope management at some
point)

Given your points below, the defensive coding recommendation here would be to

- always wrap your iterators in scoped_resource() to tell Python to
clean them up when the function is done
- explicitly call close_resources() after the affected for loops to
clean the resources up early

You'd still be vulnerable to resource leaks in libraries you didn't
write, but would have decent control over your own code without having
to make overly draconian changes to your style guide - you'd only need
one new rule, which is "Whenever you're iterating over something, pass
it through scoped_resource first".

To simplify this from a forwards compatibility perspective (i.e. so it
can implicitly adjust when an existing type gains a cleanup method),
we'd make scoped_resource() quite permissive, accepting arbitrary
objects with the following behaviours:

- if it's a context manager, enter it, and register the exit callback
- if it's not a context manager, but has a close() method, register
the close method
- otherwise, pass it straight through without taking any other action

This would allow folks to always declare something as a scoped
resource without impeding their ability to handle objects that aren't
resources at all.

The long term question would then become whether it made sense to have
certain language constructs implicitly mark their targets as scoped
resources *by default*, and clean them up selectively after the loop
rather than using the blunt instrument of cleaning up all previously
registered resources. If we did start seriously considering such a
change, then there would be potential utility in an "unmanaged_iter()"
wrapper which forwarded *only* the iterator protocol methods, thus
hiding any __exit__() or close() methods from scoped_resource().

However, the time to consider such a change in default behaviour would
be *after* we had some experience with explicit declarations and
management of scoped resources - plenty of folks are writing plenty of
software today in garbage collected languages (including Python), and
coping with external resource management problems as they arise, so we
don't need to do anything hasty here. I personally think an explicit
solution is likely to be sufficient (given the caveat of adding a
"gc.collect()" counterpart), with an API like `scoped_resource` being
adopted over time in libraries, frameworks and applications based on
actual defects found in running production systems as well as the
defensive coding style, and your example below makes me even more
firmly convinced that that's a better way to go.

> In my proposal (for-scoped-iterclose):
> - for robust systems: life is great -- you're still stopping to think
> a little about cleanup every time you use an iterator (because that's
> what it means to write robust code!), but since the iterators now know
> when they need cleanup and regular 'for' loops know how to invoke it,
> then 99% of the time (i.e., whenever you don't intend to re-use an
> iterator) you can be confident that just writing 'for' will do exactly
> the right thing, and the other 1% of the time (when you do want to
> re-use an iterator), you already *know* you're doing something clever.
> So the cognitive overhead on each for-loop is really low.

In mine, if your style guide says "Use scoped_resource() and an
explicit close_resources() call when iterating", you'd add it (or your
automated linter would complain that it was missing). So the cognitive
overhead is higher, but it would remain where it belongs (i.e. on
professional developers being paid to write robust code).

> - for one-off scripts: ~99% of the time (actual measurement, see
> below) everything just works, except maybe a little bit better. 1% of
> the time, you deploy the clever trick of re-using an iterator with
> multiple for loops, and it breaks, so this is some pain. Here's what
> you see:
>
>     gen_obj = ...
>     for first_line in gen_obj:
>         break
>     for lines in gen_obj:
>         ...
>
>     Traceback (most recent call last):
>       File "/tmp/foo.py", line 5, in <module>
>         for lines in gen_obj:
>     AlreadyClosedIteratorError: this iterator was already closed,
> possibly by a previous 'for' loop. (Maybe you want
> itertools.preserve?)
>
> (We could even have a PYTHONDEBUG flag that when enabled makes that
> error message include the file:line of the previous 'for' loop that
> called __iterclose__.)
>
> So this is pain! But the pain is (a) rare, not pervasive, (b)
> immediately obvious (an exception, the code doesn't work at all), not
> subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
> is reliable. It's a totally different type of pain than the pain that
> we currently impose on folks who want to write robust code.

And it's completely unecessary - with explicit scoped_resource() calls
absolutely nothing changes for the scripting use case, and even with
implicit ones, re-use *within the same scope* would still be fine
(you'd only get into trouble if the resource escaped the scope where
it was first marked as a scoped resource).

> Now compare to the new proposal (function-scoped-iterclose):
>
> - For those who want robust cleanup: Usually, I only need an iterator
> for as long as I'm iterating over it; that may or may not correspond
> to the end of the function (often won't). When these don't coincide,
> it can cause problems. E.g., consider the original example from my
> proposal:
>
>   def read_newline_separated_json(path):
>       with open(path) as f:
>           for line in f:
>               yield json.loads(line)
>
> but now suppose that I'm a Data Scientist (tm) so instead of having 1
> file full of newline-separated JSON, I have a 100 gigabytes worth of
> the stuff stored in lots of files in a directory tree. Well, that's no
> problem, I'll just wrap that generator:
>
>   def read_newline_separated_json_tree(tree):
>       for root, _, paths in os.walk(tree):
>           for path in paths:
>               for document in read_newline_separated_json(join(root, path)):
>                   yield document

If you're being paid to write robust code and are using Python 3.7+,
then you'd add scoped_resource() around the
read_newline_separated_json() call and then add a close_resources()
call after that loop. That'd be part of your job, and just another
point in the long list of reasons why developing software as a
profession isn't the same thing as doing it as a hobby. We'd design
scoped_resource() in such a way that it could be harmlessly wrapped
around "paths" as well, even though we know that's technically not
necessary (since it's just a list of strings).

As noted above, I'm also open to the notion of some day making all for
loops implicitly declare the iterators they operate on as scoped
resources, but I don't think we should do that without gaining some
experience with the explicit form first (where we can be confident
that any unexpected negative consequences will be encountered by folks
already well equipped to deal with them).

> And then I'll run it on PyPy, because that's what you do when you have
> 100 GB of string processing, and... it'll crash, because the call to
> read_newline_separated_tree ends up doing thousands of calls to
> read_newline_separated_json, but never cleans up any of them up until
> the function exits, so eventually we run out of file descriptors.

And we'll go "Oops", and refactor our code to better control the scope
of our resources, either by adding a with statement around the
innermost loop or using the new scoped resources API (if such a thing
gets added). The *whole point* of iterative development is to solve
the problems you know you have, not the problems you or someone else
might potentially have at some point in the indeterminate future.

> A similar situation arises in the main loop of something like an HTTP server:
>
>   while True:
>       request = read_request(sock)
>       for response_chunk in application_handler(request):
>           send_response_chunk(sock)
>
> Here we'll accumulate arbitrary numbers of un-closed
> application_handler generators attached to the stack frame, which is
> no good at all. And this has the interesting failure mode that you'll
> probably miss it in testing, because most clients will only re-use a
> connection a small number of times.

And the fixed code (given the revised API proposal above) looks like this:

   while True:
       request = read_request(sock)
       for response_chunk in scoped_resource(application_handler(request)):
           send_response_chunk(sock)
      close_resources()

This pattern has the advantage of also working if the resources you
want to manage aren't precisely what your iterating over, or if you're
iterating over them in a while loop rather than a for loop.

> So what this means is that every time I write a for loop, I can't just
> do a quick "am I going to break out of the for-loop and then re-use
> this iterator?" check -- I have to stop and think about whether this
> for-loop is nested inside some other loop, etc.

Or you unconditionally add the scoped_resource/close_resources calls
to force non-reference-counted implementations to behave a bit more
like CPython and don't worry about it further.

> - For those who just want to write a quick script and not think about
> it: here's a script that does repeated partial for-loops over a
> generator object:
>
>     https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79
>
> (and note that the generator object even has an ineffective 'with
> open(...)' block inside it!)
>
> With the function-scoped-iterclose, this script would continue to work
> as it does now. Excellent.

As it would with the explicit scoped_resource/close_resources API.

> But, suppose that I decide that that main() function is really
> complicated and that it would be better to refactor some of those
> loops out into helper functions. (Probably actually true in this
> example.) So I do that and... suddenly the code breaks. And in a
> rather confusing way, because it has to do with this complicated
> long-distance interaction between two different 'for' loops *and*
> where they're placed with respect to the original function versus the
> helper function.

I do agree the fact that it would break common code refactoring
patterns is a good counter-argument against the idea of ever calling
scoped_resource() implicitly.

> Anyway, in summary: function-scoped-iterclose doesn't seem to
> accomplish my goal of getting rid of the *type* of pain involved when
> you have to run a background thread in your brain that's doing
> constant paranoid checking every time you write a for loop. Instead it
> arguably takes that type of pain and spreads it around both the
> experts and the novices :-/.

Does the addition of the explicit close_resources() API mitigate your concern?

> Now, let's look at some evidence about how disruptive the two
> proposals are for real code:
>
> As mentioned else-thread, I wrote a stupid little CPython hack [1] to
> report when the same iterator object gets passed to multiple 'for'
> loops, and ran the CPython and Django testsuites with it [2]. Looking
> just at generator objects [3], across these two large codebases there
> are exactly 4 places where this happens.

The standard library and a web framework are in no way typical of
Python application and scripting code.

> 3) Django django/utils/regex_helper.py:236
>
> This code is very similar to the previous example in its general
> outline, except that the 'for' loops *have* been factored out into
> utility functions. So in this case for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive.

But explicitly scoped resource management leaves it alone.

> 4) CPython's Lib/test/test_generators.py:723
>
> I have to admit I cannot figure out what this code is doing, besides
> showing off :-). But the different 'for' loops are in different stack
> frames, so I'm pretty sure that for-scoped-iterclose and
> function-scoped-iterclose would be equally disruptive.

And explicitly scoped resource management again leaves it alone.

> Obviously there's a bias here in that these are still relatively
> "serious" libraries; I don't have a big corpus of one-off scripts that
> are just a big __main__, though gprof2html.py isn't far from that. (If
> anyone knows where to find such a thing let me know...) But still, the
> tally here is that out of 4 examples, we have 1 subtle bug that
> iterclose might have caught, 2 cases where for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive, and only 1 where
> function-scoped-iterclose is less disruptive -- and in that case it's
> arguably just avoiding an obvious error now in favor of a more
> confusing error later.
>
> If this reduced the backwards-incompatible cases by a factor of, like,
> 10x or 100x, then that would be a pretty strong argument in its favor.
> But it seems to be more like... 1.5x.

The explicit-API-only aspect of the proposal eliminates 100% of the
backwards incompatibilities :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia