[Python-ideas] Deterministic iterator cleanup

Tue Oct 25 18:25:18 EDT 2016

On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 20 October 2016 at 07:02, Nathaniel Smith <njs at pobox.com> wrote:
>> The first change is to replace the outer for loop with a while/pop
>> loop, so that if an exception occurs we'll know which iterables remain
>> to be processed:
>>
>> def chain(*iterables):
>>     try:
>>         while iterables:
>>             for element in iterables.pop(0):
>>                 yield element
>>     ...
>>
>> Now, what do we do if an exception does occur? We need to call
>> iterclose on all of the remaining iterables, but the tricky bit is
>> that this might itself raise new exceptions. If this happens, we don't
>> want to abort early; instead, we want to continue until we've closed
>> all the iterables, and then raise a chained exception. Basically what
>> we want is:
>>
>> def chain(*iterables):
>>     try:
>>         while iterables:
>>             for element in iterables.pop(0):
>>                 yield element
>>     finally:
>>         try:
>>             operators.iterclose(iter(iterables[0]))
>>         finally:
>>             try:
>>                 operators.iterclose(iter(iterables[1]))
>>             finally:
>>                 try:
>>                     operators.iterclose(iter(iterables[2]))
>>                 finally:
>>                     ...
>>
>> but of course that's not valid syntax. Fortunately, it's not too hard
>> to rewrite that into real Python -- but it's a little dense:
>>
>> def chain(*iterables):
>>     try:
>>         while iterables:
>>             for element in iterables.pop(0):
>>                 yield element
>>     # This is equivalent to the nested-finally chain above:
>>     except BaseException as last_exc:
>>         for iterable in iterables:
>>             try:
>>                 operators.iterclose(iter(iterable))
>>             except BaseException as new_exc:
>>                 if new_exc.__context__ is None:
>>                     new_exc.__context__ = last_exc
>>                 last_exc = new_exc
>>         raise last_exc
>>
>> It's probably worth wrapping that bottom part into an iterclose_all()
>> helper, since the pattern probably occurs in other cases as well.
>> (Actually, now that I think about it, the map() example in the text
>> should be doing this instead of what it's currently doing... I'll fix
>> that.)
>
> At this point your code is starting to look a whole lot like the code
> in contextlib.ExitStack.__exit__ :)

One of the versions I tried but didn't include in my email used
ExitStack :-). It turns out not to work here: the problem is that we
effectively need to enter *all* the contexts before unwinding, even if
trying to enter one of them fails. ExitStack is nested like (try (try
(try ... finally) finally) finally), and we need (try finally (try
finally (try finally ...))) But this is just a small side-point
anyway, since most code is not implementing complicated
meta-iterators; I'll address your real proposal below.

> Accordingly, I'm going to suggest that while I agree the problem you
> describe is one that genuinely emerges in large production
> applications and other complex systems, this particular solution is
> simply far too intrusive to be accepted as a language change for
> Python - you're talking a fundamental change to the meaning of
> iteration for the sake of the relatively small portion of the
> community that either work on such complex services, or insist on
> writing their code as if it might become part of such a service, even
> when it currently isn't. Given that simple applications vastly
> outnumber complex ones, and always will, I think making such a change
> would be a bad trade-off that didn't come close to justifying the
> costs imposed on the rest of the ecosystem to adjust to it.
>
> A potentially more fruitful direction of research to pursue for 3.7
> would be the notion of "frame local resources", where each Python
> level execution frame implicitly provided a lazily instantiated
> ExitStack instance (or an equivalent) for resource management.
> Assuming that it offered an "enter_frame_context" function that mapped
> to "contextlib.ExitStack.enter_context", such a system would let us do
> things like:

So basically a 'with expression', that gives up the block syntax --
taking its scope from the current function instead -- in return for
being usable in expression context? That's a really interesting, and I
see the intuition that it might be less disruptive if our implicit
iterclose calls are scoped to the function rather than the 'for' loop.

But having thought about it and investigated some... I don't think
function-scoping addresses my problem, and I don't see evidence that
it's meaningfully less disruptive to existing code.

First, "my problem":

Obviously, Python's a language that should be usable for folks doing
one-off scripts, and for paranoid folks trying to write robust complex
systems, and for everyone in between -- these are all really important
constituencies. And unfortunately, there is a trade-off here, where
the changes we're discussing effect these constituencies differently.
But it's not just a matter of shifting around a fixed amount of pain;
the *quality* of the pain really changes under the different
proposals.

In the status quo:
- for one-off scripts: you can just let the GC worry about generator
and file handle cleanup, re-use iterators, whatever, it's cool
- for robust systems: because it's the *caller's* responsibility to
ensure that iterators are cleaned up, you... kinda can't really use
generators without -- pick one -- (a) draconian style guides (like
forbidding 'with' inside generators or forbidding bare 'for' loops
entirely), (b) lots of auditing (every time you write a 'for' loop, go
read the source to the generator you're iterating over -- no
modularity for you and let's hope the answer doesn't change!), or (c)
introducing really subtle bugs. Or all of the above. It's true that a
lot of the time you can ignore this problem and get away with it one
way or another, but if you're trying to write robust code then this
doesn't really help -- it's like saying the footgun only has 1 bullet
in the chamber. Not as reassuring as you'd think. It's like if every
time you called a function, you had to explicitly say whether you
wanted exception handling to be enabled inside that function, and if
you forgot then the interpreter might just skip the 'finally' blocks
while unwinding. There's just *isn't* a good solution available.

In my proposal (for-scoped-iterclose):
- for robust systems: life is great -- you're still stopping to think
a little about cleanup every time you use an iterator (because that's
what it means to write robust code!), but since the iterators now know
when they need cleanup and regular 'for' loops know how to invoke it,
then 99% of the time (i.e., whenever you don't intend to re-use an
iterator) you can be confident that just writing 'for' will do exactly
the right thing, and the other 1% of the time (when you do want to
re-use an iterator), you already *know* you're doing something clever.
So the cognitive overhead on each for-loop is really low.
- for one-off scripts: ~99% of the time (actual measurement, see
below) everything just works, except maybe a little bit better. 1% of
the time, you deploy the clever trick of re-using an iterator with
multiple for loops, and it breaks, so this is some pain. Here's what
you see:

    gen_obj = ...
    for first_line in gen_obj:
        break
    for lines in gen_obj:
        ...

    Traceback (most recent call last):
      File "/tmp/foo.py", line 5, in <module>
        for lines in gen_obj:
    AlreadyClosedIteratorError: this iterator was already closed,
possibly by a previous 'for' loop. (Maybe you want
itertools.preserve?)

(We could even have a PYTHONDEBUG flag that when enabled makes that
error message include the file:line of the previous 'for' loop that
called __iterclose__.)

So this is pain! But the pain is (a) rare, not pervasive, (b)
immediately obvious (an exception, the code doesn't work at all), not
subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
is reliable. It's a totally different type of pain than the pain that
we currently impose on folks who want to write robust code.

Now compare to the new proposal (function-scoped-iterclose):

- For those who want robust cleanup: Usually, I only need an iterator
for as long as I'm iterating over it; that may or may not correspond
to the end of the function (often won't). When these don't coincide,
it can cause problems. E.g., consider the original example from my
proposal:

  def read_newline_separated_json(path):
      with open(path) as f:
          for line in f:
              yield json.loads(line)

but now suppose that I'm a Data Scientist (tm) so instead of having 1
file full of newline-separated JSON, I have a 100 gigabytes worth of
the stuff stored in lots of files in a directory tree. Well, that's no
problem, I'll just wrap that generator:

  def read_newline_separated_json_tree(tree):
      for root, _, paths in os.walk(tree):
          for path in paths:
              for document in read_newline_separated_json(join(root, path)):
                  yield document

And then I'll run it on PyPy, because that's what you do when you have
100 GB of string processing, and... it'll crash, because the call to
read_newline_separated_tree ends up doing thousands of calls to
read_newline_separated_json, but never cleans up any of them up until
the function exits, so eventually we run out of file descriptors.

A similar situation arises in the main loop of something like an HTTP server:

  while True:
      request = read_request(sock)
      for response_chunk in application_handler(request):
          send_response_chunk(sock)

Here we'll accumulate arbitrary numbers of un-closed
application_handler generators attached to the stack frame, which is
no good at all. And this has the interesting failure mode that you'll
probably miss it in testing, because most clients will only re-use a
connection a small number of times.

So what this means is that every time I write a for loop, I can't just
do a quick "am I going to break out of the for-loop and then re-use
this iterator?" check -- I have to stop and think about whether this
for-loop is nested inside some other loop, etc. And, again, if I get
it wrong, then it's a subtle bug that will bite me later. It's true
that with the status quo, we need to wrap, X% of for-loops with 'with'
blocks, and with this proposal that number would drop to, I don't
know, (X/5)% or something. But that's not the most important cost: the
most important cost is the cognitive overhead of figuring out which
for-loops need the special treatment, and in this proposal that
checking is actually *more* complicated than the status quo.

- For those who just want to write a quick script and not think about
it: here's a script that does repeated partial for-loops over a
generator object:

    https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79

(and note that the generator object even has an ineffective 'with
open(...)' block inside it!)

With the function-scoped-iterclose, this script would continue to work
as it does now. Excellent.

But, suppose that I decide that that main() function is really
complicated and that it would be better to refactor some of those
loops out into helper functions. (Probably actually true in this
example.) So I do that and... suddenly the code breaks. And in a
rather confusing way, because it has to do with this complicated
long-distance interaction between two different 'for' loops *and*
where they're placed with respect to the original function versus the
helper function.

If I were an intermediate-level Python student (and I'm pretty sure
anyone who is starting to get clever with re-using iterators counts as
"intermediate level"), then I'm pretty sure I'd actually prefer the
immediate obvious feedback from the for-scoped-iterclose. This would
actually be a good time to teach folks about this aspect of resource
handling, actually -- it's certainly an important thing to learn
eventually on your way to Python mastery, even if it isn't needed for
every script.

In the pypy-dev thread about this proposal, there's some very
distressed emails from someone who's been writing Python for a long
time but only just realized that generator cleanup relies on the
garbage collector:

    https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html
    https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html

It's unpleasant to have the rug pulled out from under you like this
and suddenly realize that you might have to go re-evaluate all the
code you've ever written, and making for loops safe-by-default and
fail-fast-when-unsafe avoids that.

Anyway, in summary: function-scoped-iterclose doesn't seem to
accomplish my goal of getting rid of the *type* of pain involved when
you have to run a background thread in your brain that's doing
constant paranoid checking every time you write a for loop. Instead it
arguably takes that type of pain and spreads it around both the
experts and the novices :-/.

-------------

Now, let's look at some evidence about how disruptive the two
proposals are for real code:

As mentioned else-thread, I wrote a stupid little CPython hack [1] to
report when the same iterator object gets passed to multiple 'for'
loops, and ran the CPython and Django testsuites with it [2]. Looking
just at generator objects [3], across these two large codebases there
are exactly 4 places where this happens. (Rough idea of prevalence:
these 4 places together account for a total of 8 'for' loops; this is
out of a total of 11,503 'for' loops total, of which 665 involve
generator objects.) The 4 places are:

1) CPython's Lib/test/test_collections.py:1135, Lib/_collections_abc.py:378

This appears to be a bug in the CPython test suite -- the little MySet
class does 'def __init__(self, itr): self.contents = itr', which
assumes that itr is a container that can be repeatedly iterated. But a
bunch of the methods on collections.abc.Set like to pass in a
generator object here instead, which breaks everything. If repeated
'for' loops on generators raised an error then this bug would have
been caught much sooner.

2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75

Discussed above -- as written, for-scoped-iterclose would break this
script, but function-scoped-iterclose would not, so here
function-scoped-iterclose wins.

3) Django django/utils/regex_helper.py:236

This code is very similar to the previous example in its general
outline, except that the 'for' loops *have* been factored out into
utility functions. So in this case for-scoped-iterclose and
function-scoped-iterclose are equally disruptive.

4) CPython's Lib/test/test_generators.py:723

I have to admit I cannot figure out what this code is doing, besides
showing off :-). But the different 'for' loops are in different stack
frames, so I'm pretty sure that for-scoped-iterclose and
function-scoped-iterclose would be equally disruptive.

Obviously there's a bias here in that these are still relatively
"serious" libraries; I don't have a big corpus of one-off scripts that
are just a big __main__, though gprof2html.py isn't far from that. (If
anyone knows where to find such a thing let me know...) But still, the
tally here is that out of 4 examples, we have 1 subtle bug that
iterclose might have caught, 2 cases where for-scoped-iterclose and
function-scoped-iterclose are equally disruptive, and only 1 where
function-scoped-iterclose is less disruptive -- and in that case it's
arguably just avoiding an obvious error now in favor of a more
confusing error later.

If this reduced the backwards-incompatible cases by a factor of, like,
10x or 100x, then that would be a pretty strong argument in its favor.
But it seems to be more like... 1.5x.

-n

[1] https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2
[2] CPython: revision b0a272709b from the github mirror; Django:
revision 90c3b11e87
[3] I also looked at "all iterators" and "all iterators with .close
methods", but this email is long enough... basically the pattern is
the same: there are another 13 'for' loops that involve repeated
iteration over non-generator objects, and they're roughly equally
split between spurious effects due to bugs in the CPython test-suite
or my instrumentation, cases where for-scoped-iterclose and
function-scoped-iterclose both cause the same problems, and cases
where function-scoped-iterclose is less disruptive.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org