[pypy-dev] RFC: draft idea for making for loops automatically close iterators

Mon Oct 17 04:08:37 EDT 2016

Hi all,

I've been poking at an idea for changing how 'for' loops work to
hopefully make them work better for pypy and async/await code. I
haven't taken it to python-ideas yet -- this is its first public
outing, actually -- but since it directly addresses pypy GC issues I
thought I'd send around a draft to see what you think. (E.g., would
this be something that makes your life easier?)

Thanks,
-n

----

Abstract
========

We propose to extend the iterator protocol with a new
``__(a)iterclose__`` slot, which is called automatically on exit from
``(async) for`` loops, regardless of how they exit. This allows for
convenient, deterministic cleanup of resources held by iterators
without reliance on the garbage collector. This is especially
important (and urgent) for asynchronous generators.

Note
====

In practical terms, the proposal here is divided into two separate
parts: the handling of async iterators, which should ideally be
implemented ASAP, and the handling of regular iterators, which is a
larger but more relaxed project that won't even start until 3.7. But
since the changes are closely related, and we probably don't want to
end up with async iterators and regular iterators diverging in the
long run, it seems useful to look at them together.

Background and motivation
=========================

Python iterables often hold resources which require cleanup. For
example: ``file`` objects need to be closed; the `WSGI spec
<https://www.python.org/dev/peps/pep-0333/>`_ adds a ``close`` method
on top of the regular iterator protocol and demands that consumers
call it at the appropriate time (though forgetting to do so is a
`frequent source of bugs
<http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html>`_);
and PEP 342 (based on PEP 325) extended generator objects to add a
``close`` method to allow generators to clean up after themselves.

Generally, objects that need to clean up after themselves define a
``__del__`` method to ensure that this cleanup will happen eventually,
when the object is garbage collected. However, relying on the garbage
collector for cleanup like this causes serious problems in at least
two cases:

- In Python implementations that do not use reference counting (e.g.
PyPy, Jython), calls to ``__del__`` may be arbitrarily delayed, yet
many situations depend on *prompt* cleanup of resources. Delayed
cleanup produces problems like crashes due to file descriptor
exhaustion, or WSGI timing middleware that collects bogus times.

- Async generators (PEP 525) can only perform cleanup under the
supervision of the appropriate coroutine runner. ``__del__`` doesn't
have access to the coroutine runner; indeed, the coroutine runner
might be garbage collected before the generator object. So relying on
the garbage collector is effectively impossible without some kind of
language extension. (PEP 525 does provide such an extension, but it
has serious flaws; see the "rejected alternatives" section below.)

The usual recommendation, therefore, is to avoid the garbage collector
by using ``with`` blocks. For example, this code opens a file but
relies on the garbage collector to close it::

  def read_newline_separated_json(path):
      for line in open(path):
          yield json.loads(line)

  for document in read_newline_separated_json(path):
      ...

and recent versions of CPython will point this out by issuing a
``ResourceWarning``, nudging us to fix it by adding a ``with`` block::

  def read_newline_separated_json(path):
      with open(path) as file_handle:      # <-- new
          for line in file_handle:
              yield json.loads(line)

  for document in read_newline_separated_json(path):  # <-- outer for loop
      ...

But there's a subtlety here, caused by the interaction of ``with``
blocks and generators. ``with`` blocks are Python's main tool for
ensuring cleanup, and they're a powerful one, because they pin the
lifetime of a resource to the lifetime of a stack frame. But in the
case of generators, we need a ``with`` block to ensure that the stack
frame is cleaned up!

In this case, adding the ``with`` block *is* enough to shut up the
``ResourceWarning``, but this is misleading -- the file object cleanup
here is still dependent on the garbage collector. The ``with`` block
will only be unwound when the ``read_newline_separated_json``
generator is closed. If the outer ``for`` loop runs to completion then
the cleanup will happen immediately; but if this loop is terminated
early by a ``break`` or an exception, then the ``with`` block won't
fire until the generator object is garbage collected.

The correct solution requires that all *users* of this API wrap every
``for`` loop in its own ``with`` block::

  with closing(read_newline_separated_json(path)) as genobj:
      for document in genobj:
          ...

This gets even worse if we consider the idiom of decomposing a complex
pipeline into multiple nested generators::

  def read_users(path):
      with closing(read_newline_separated_json(path)) as gen:
          for document in gen:
              yield User.from_json(document)

  def users_in_group(path, group):
      with closing(read_users(path)) as gen:
          for user in gen:
              if user.group == group:
                  yield user

In general if you have N nested generators then you need N+1 ``with``
blocks to clean up 1 file. And good defensive programming would
suggest that any time we use a generator, we should assume the
possibility that there could be at least one ``with`` block somewhere
in its (potentially transitive) call stack, either now or in the
future, and thus always wrap it in a ``with``. But in practice,
basically nobody does this, because it's awful and programmers would
rather write buggy code than tiresome repetitive code.

Is this worth fixing? Previously I would have argued yes, but that it
was a low priority -- until the advent of async generators, which
makes this problem much more urgent. Async generators cannot do
cleanup *at all* without some mechanism for deterministic cleanup that
people will actually use, and async generators are particularly likely
to hold resources like file descriptors. (After all, if they weren't
doing I/O, they'd be generators, not async generators.) And if we
don't get it right now when async generators are first rolling out,
then it'll be much harder to fix later.

The proposal itself is simple in concept: add a ``__(a)iterclose__``
method to the iterator protocol, and have (async) ``for`` loops call
it when the loop is exited. Effectively, we're taking the current
cumbersome idiom (``with`` block + ``for`` loop) and merging them
together into a fancier ``for``.

Rejected alternatives
=====================

PEP 525 asyncgen hooks
----------------------

PEP 525 proposes a `set of global hooks managed by new
``sys.{get/set}_asyncgen_hooks()`` functions
<https://www.python.org/dev/peps/pep-0525/#finalization>`_, which
event loops are intended to register to take control of async
generator finalization. The stated goal is that "the end user does not
need to care about the finalization problem, and everything just
works". Unfortunately, though, the approach has a number of downsides:

- It adds substantial complexity: we have new global interpreter
state, and new public API in asyncio (``loop.shutdown_asyncgens()``)
that users have to remember to call at the appropriate time.

- The ``firstiter`` hook has to be able to uniquely identify what
coroutine runner is being used at any given moment, and the
``finalizer`` hook has to be able to take a generator object and
figure out which coroutine runner was supervising it initially. These
requirements introduce surprisingly complicated couplings and
potential constraints on future designs.

  For example, one might plausibly want to start several OS threads,
and run a separate asyncio event loop in each --
``asyncio.BaseEventLoopPolicy`` takes some trouble to support exactly
this use case. But once there are multiple event loops running
simultaneously, the hooks have the problem of somehow matching up each
generator to its corresponding event loop. For ``firstiter`` this
isn't so bad -- we can assume that the thread where ``firstiter`` is
called is matches the thread whose event loop we want -- but
``finalizer`` is trickier, since the generator might be collected in a
different thread than it started in.

  The code currently in the asyncio master branch doesn't consider
this situation at all. If you try, what will happen is that whichever
event loop starts up last will run the finalizers for all threads,
which will probably blow up spectacularly. The current implementation
is also broken if the following sequence of events occurs:

  1. start a loop
  2. firstiter(agen) invoked
  3. stop the loop, but forget to call ``loop.shutdown_asyncgens()``.
(NB: No existing asyncio programs call ``loop.shutdown_asyncgens()``,
and it's never called automatically.)
  4. create a new loop
  5. finalizer(agen) invoked -- now the new loop will happily attempt
to execute agen.aclose()

  These issues with the current implementation are fixable (XX FIXME
file bugs), but they give a sense of how tricky this API is.

  It gets worse: suppose I want to run an asyncio event loop in one
thread and a twisted reactor loop in another (e.g., to take advantage
of twisted functionality that hasn't yet been ported to run on top of
asyncio, with communication between the threads using
``call_soon_threadsafe`` / ``callFromThread``). Now the two event
loops have to fight over the hooks.

  Curio currently doesn't even have the concept of a global event loop.

  A more obscure case arises with libraries like `async_generator
<https://pypi.python.org/pypi/async_generator>`_, which runs code
under a "proxy" coroutine runner that handles some yields itself while
forwarding others on to the real event loop. Here it is the *inner*
coroutine runner that should be used for calling ``aclose``, not the
outer one, but there is no way for the hooks to know this. Though
obviously this isn't a problem for async_generator itself since it's
obsoleted by PEP 525, and it's not clear whether this technique has
other use cases. But on the other hand, maybe we should try to keep
our options open; we have so little experience with async/await that
it's hard to say what clever tricks will turn out to be important.

  Basically the point is, these hooks have extremely delicate
semantics and it's not at all clear that we know how to deal with all
the situations they cause.

- The new semantics aren't part of the abstract async iterator
protocol, but are instead tied `specifically to the async generator
concrete type <XX find and link Yury's email saying this>`_. If you
have an async iterator implemented using a class, like::

    class MyAIterator:
        def __anext__():
            ...

  then you can't refactor this into an async generator without
changing the semantics, and vice-versa. This seems very unpythonic.
(It also leaves open the question of what exactly class-based async
iterators are supposed to do, given that since they face exactly the
same cleanup problems as async generators.)

And then assuming we manage to avoid the problems above, the best-case
payoff is that we get GC semantics for async generators. So after all
that it's still effectively a CPython-only feature (!!), and even
there it has poor ergonomics, e.g., if ``aclose`` raises an error then
it will get lost. In practice, code that wants to be portable across
Python implementations or handle exceptions reliably will still have
to write things like::

    with aclosing(get_newline_separated_json(url)) as agen:
        async for document in agen:
            ...

just like it would if the asyncgen hooks didn't exist.

By comparison, the present proposal is straightforward and
understandable, requires no global state or global coordination
between coroutine runners, works equally well for generators and other
iterators, works on PyPy, gives properly propagating exceptions by
default, etc.

Always inject resources, and do all cleanup at the top level
------------------------------------------------------------

It was suggested on python-dev (XX find link) that a pattern to avoid
these problems is to always pass resources in from above, e.g.
``read_newline_separated_json`` should take a file object rather than
a path, with cleanup handled at the top level::

  def read_newline_separated_json(file_handle):
      for line in file_handle:
          yield json.loads(line)

  def read_users(file_handle):
      for document in read_newline_separated_json(file_handle):
          yield User.from_json(document)

  with open(path) as file_handle:
      for user in read_users(file_handle):
          ...

This works well in simple cases; here it lets us avoid the "N+1
problem". But unfortunately, it breaks down quickly when things get
more complex. Consider if instead of reading from a file, our
generator was processing the body returned by an HTTP GET request --
while handling redirects and authentication via OAUTH. Then we'd
really want the sockets to be managed down inside our HTTP client
library, not at the top level. Plus there are other cases where
``finally`` blocks embedded inside generators are important in their
own right: db transaction management, emitting logging information
during cleanup (one of the major motivating use cases for WSGI
``close``), and so forth.

Specification: final state
==========================

This section describes where we want to eventually end up, though
there are some backwards compatibility issues that mean we can't jump
directly here. A later section describes the transition plan.

Guiding principles
------------------

Generally, ``__(a)iterclose__`` implementations should:

- be idempotent,
- perform any cleanup that is appropriate on the assumption that the
iterator will not be used again after ``__(a)iterclose__`` is called.
In particular, once ``__(a)iterclose__`` has been called then calling
``__(a)next__`` produces undefined behavior.

And generally, any code which starts iterating through an iterable
with the intention of exhausting it, should arrange to make sure that
``__(a)iterclose__`` is eventually called, whether or not the iterator
is actually exhausted.

Changes to iteration
--------------------

The core proposal is the change in behavior of ``for`` loops. Given
this Python code::

  for VAR in ITERABLE:
      LOOP-BODY
  else:
      ELSE-BODY

we desugar to the equivalent of::

  _iter = iter(ITERABLE)
  _iterclose = getattr(type(_iter), "__iterclose__", lambda: None)
  try:
      traditional-for VAR in _iter:
          LOOP-BODY
      else:
          ELSE-BODY
  finally:
      _iterclose(_iter)

where the "traditional-for statement" here is meant as a shorthand for
the classic 3.5-and-earlier ``for`` loop semantics.

Besides the top-level ``for`` statement, Python also contains several
other places where iterators are consumed. For consistency, these
should call ``__iterclose__`` as well using semantics equivalent to
the above. This includes:

- ``for`` loops inside comprehensions
- ``*`` unpacking
- functions which accept and fully consume iterables, like
``list(it)``, ``tuple(it)``, ``itertools.product(it1, it2, ...)``, and
others.

Changes to async iteration
--------------------------

We also make the analogous changes to async iteration constructs,
except that the new slot is called ``__aiterclose__``, and it's an
async method that gets ``await``\ed.

Modifications to basic iterator types
-------------------------------------

Generator objects (including those created by generator comprehensions):
- ``__iterclose__`` calls ``self.close()``
- ``__del__`` calls ``self.close()`` (same as now), and additionally
issues a ``ResourceWarning`` if ``aclose`` has not been called. This
warning is hidden by default, but can be enabled for those who want to
make sure they aren't inadverdantly relying on CPython-specific GC
semantics.

Async generator objects (including those created by async generator
comprehensions):
- ``__aiterclose__`` calls ``self.aclose()``
- ``__del__`` issues a ``RuntimeWarning`` if ``aclose`` has not been
called, since this probably indicates a latent bug, similar to the
"coroutine never awaited" warning.

OPEN QUESTION: should file objects implement ``__iterclose__`` to
close the file? On the one hand this would make this change more
disruptive; on the other hand people really like writing ``for line in
open(...): ...``, and if we get used to iterators taking care of their
own cleanup then it might become very weird if files don't.

New convenience functions
-------------------------

The ``itertools`` module gains a new iterator wrapper that can be used
to selectively disable the new ``__iterclose__`` behavior::

  # XX FIXME: I feel like there might be a better name for this one?
  class protect(iterable):
      def __init__(self, iterable):
          self._it = iter(iterable)

      def __iter__(self):
          return self

      def __next__(self):
          return next(self._it)

      def __iterclose__(self):
          # Swallow __iterclose__ without passing it on
          pass

Example usage (assuming that file objects implements ``__iterclose__``)::

  with open(...) as handle:
      # Iterate through the same file twice:
      for line in itertools.protect(handle):
          ...
      handle.seek(0)
      for line in itertools.protect(handle):
          ...

The ``operator`` module gains two new functions, with semantics
equivalent to the following::

  def iterclose(it):
      if hasattr(type(it), "__iterclose__"):
          type(it).__iterclose__(it)

  async def aiterclose(ait):
      if hasattr(type(ait), "__aiterclose__"):
          await type(ait).__aiterclose__(ait)

These are particularly useful when implementing the changes in the next section:

__iterclose__ implementations for iterator wrappers
---------------------------------------------------

Python ships a number of iterator types that act as wrappers around
other iterators: ``map``, ``zip``, ``itertools.accumulate``,
``csv.reader``, and others. These iterators should define a
``__iterclose__`` method which calls ``__iterclose__`` in turn on
their underlying iterators. For example, ``map`` could be implemented
as::

  class map:
      def __init__(self, fn, *iterables):
          self._fn = fn
          self._iters = [iter(iterable) for iterable in iterables]

      def __iter__(self):
          return self

      def __next__(self):
          return self._fn(*[next(it) for it in self._iters])

      def __iterclose__(self):
          for it in self._iters:
              operator.iterclose(it)

In some cases this requires some subtlety; for example,
```itertools.tee``
<https://docs.python.org/3/library/itertools.html#itertools.tee>`_
should not call ``__iterclose__`` on the underlying iterator until it
has been called on *all* of the clone iterators.

Example / Rationale
-------------------

The payoff for all this is that we can now write straightforward code like::

  def read_newline_separated_json(path):
      for line in open(path):
          yield json.loads(line)

and be confident that the file will receive deterministic cleanup
*without the end-user having to take any special effort*, even in
complex cases. For example, consider this silly pipeline::

  list(map(lambda key: key.upper(),
           doc["key"] for doc in read_newline_separated_json(path)))

If our file contains a document where ``doc["key"]`` turns out to be
an integer, then the following sequence of events will happen:

1. ``key.upper()`` raises an ``AttributeError``, which propagates out
of the ``map`` and triggers the implicit ``finally`` block inside
``list``.
2. The ``finally`` block in ``list`` calls ``__iterclose__()`` on the
map object.
3. ``map.__iterclose__()`` calls ``__iterclose__()`` on the generator
comprehension object.
4. This injects a ``GeneratorExit`` exception into the generator
comprehension body, which is currently suspended inside the
comprehension's ``for`` loop body.
5. The exception propagates out of the ``for`` loop, triggering the
``for`` loop's implicit ``finally`` block, which calls
``__iterclose__`` on the generator object representing the call to
``read_newline_separated_json``.
6. This injects an inner ``GeneratorExit`` exception into the body of
``read_newline_separated_json``, currently suspended at the ``yield``.
7. The inner ``GeneratorExit`` propagates out of the ``for`` loop,
triggering the ``for`` loop's implicit ``finally`` block, which calls
``__iterclose__()`` on the file object.
8. The file object is closed.
9. The inner ``GeneratorExit`` resumes propagating, hits the boundary
of the generator function, and causes
``read_newline_separated_json``'s ``__iterclose__()`` method to return
successfully.
10. Control returns to the generator comprehension body, and the outer
``GeneratorExit`` continues propagating, allowing the comprehension's
``__iterclose__()`` to return successfully.
11. The rest of the ``__iterclose__()`` calls unwind without incident,
back into the body of ``list``.
12. The original ``AttributeError`` resumes propagating.

(The details above assume that we implement ``file.__iterclose__``; if
not then add a ``with`` block to ``read_newline_separated_json`` and
essentially the same logic goes through.)

Of course, from the user's point of view, this can be simplified down to just:

1. ``int.upper()`` raises an ``AttributeError``
1. The file object is closed.
2. The ``AttributeError`` propagates out of ``list``

So we've accomplished our goal of making this "just work" without the
user having to think about it.

Specification: how to get there from here
=========================================

While the majority of existing ``for`` loops will continue to produce
identical results, the proposed changes will produce
backwards-incompatible behavior in some cases. Example::

  def read_csv_with_header(lines_iterable):
      lines_iterator = iter(lines_iterable)
      # Used to be correct; now needs an itertools.protect() here:
      for line in lines_iterator:
          column_names = line.strip().split("\t")
          break
      for line in lines_iterator:
          values = line.strip().split("\t")
          record = dict(zip(column_names, values))
          yield record

Specifically, the incompatibility happens when all of these factors
come together:

- The automatic calling of ``__(a)iterclose__`` is enabled
- The iterable did not previously define ``__(a)iterclose__``
- The iterable does now define ``__(a)iterclose__``
- The iterable is re-used after the ``for`` loop exits

So the problem is how to manage this transition, and those are the
levers we have to work with.

First, observe that the only async iterables where we propose to add
``__aiterclose__`` are async generators, and there is currently no
existing code using async generators (though this will start changing
very soon), so the async changes do not produce any backwards
incompatibilities. (There is existing code using async iterators, but
using the new async for loop on an old async iterator is harmless,
because old async iterators don't have ``__aiterclose__``.) In
addition, PEP 525 was accepted on a provisional basis, and async
generators are by far the biggest beneficiary of this PEP's proposed
changes. Therefore, I think we should strongly consider enabling
``__aiterclose__`` for ``async for`` loops and async generators ASAP,
ideally for 3.6.0 or 3.6.1.

For the non-async world, things are harder, but here's a potential
transition path:

In 3.7:

Our goal is that existing unsafe code will start emitting warnings,
while those who want to opt-in to the future can do that immediately:

- We immediately add all the ``__iterclose__`` methods described above.
- If ``from __future__ import iterclose`` is in effect, then ``for``
loops and ``*`` unpacking call ``__iterclose__`` as specified above.
- If the future is *not* enabled, then ``for`` loops and ``*``
unpacking do *not* call ``__iterclose__``. But they do call some other
method instead, e.g. ``__iterclose_warning__``.
- Similarly, functions like ``list`` use stack introspection (!!) to
check whether their direct caller has ``__future__.iterclose``
enabled, and use this to decide whether to call ``__iterclose__`` or
``__iterclose_warning__``.
- For all the wrapper iterators, we also add ``__iterclose_warning__``
methods that forward to the ``__iterclose_warning__`` method of the
underlying iterator or iterators.
- For generators (and files, if we decide to do that),
``__iterclose_warning__`` is defined to set an internal flag, and
other methods on the object are modified to check for this flag. If
they find the flag set, they issue a ``PendingDeprecationWarning`` to
inform the user that in the future this sequence would have led to a
use-after-close situation and the user should use ``protect()``.

In 3.8:

- Switch from ``PendingDeprecationWarning`` to ``DeprecationWarning``

In 3.9:

- Enable the future unconditionally and remove all the
``__iterclose_warning__`` stuff.

I believe that this satisfies the normal requirements for this kind of
transition -- opt-in initially, with warnings targeted precisely to
the cases that will be effected, and a long deprecation cycle.

Probably the most controversial / risky part of this is the use of
stack introspection to make the iterable-consuming functions sensitive
to a ``__future__`` setting, though I haven't thought of any situation
where it would actually go wrong yet...

-- 
Nathaniel J. Smith -- https://vorpus.org