[Python-ideas] Deterministic iterator cleanup

Nathaniel Smith njs at pobox.com
Fri Oct 21 02:37:25 EDT 2016


On Wed, Oct 19, 2016 at 7:07 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 10/19/2016 12:38 AM, Nathaniel Smith wrote:
>
>> I'd like to propose that Python's iterator protocol be enhanced to add
>> a first-class notion of completion / cleanup.
>
>
> With respect the the standard iterator protocol, a very solid -1 from me.
> (I leave commenting specifically on __aiterclose__ to Yury.)
>
> 1. I consider the introduction of iterables and the new iterator protocol in
> 2.2 and their gradual replacement of lists in many situations to be the
> greatest enhancement to Python since 1.3 (my first version).  They are, to
> me, they one of Python's greatest features and the minimal nature of the
> protocol an essential part of what makes them great.

Minimalism for its own sake isn't really a core Python value, and in
any case the minimalism ship has kinda sailed -- we effectively
already have send/throw/close as optional parts of the protocol
(they're most strongly associated with generators, but you're free to
add them to your own iterators and e.g. yield from will happily work
with that). This proposal is basically "we formalize and start
automatically calling the 'close' methods that are already there".

> 2. I think you greatly underestimate the negative impact, just as we did
> with changing str is bytes to str is unicode.  The change itself, embodied
> in for loops, will break most non-trivial programs.  You yourself note that
> there will have to be pervasive changes in the stdlib just to begin fixing
> the breakage.

The long-ish list of stdlib changes is about enabling the feature
everywhere, not about fixing backwards incompatibilities.

It's an important question though what programs will break and how
badly. To try and get a better handle on it I've been playing a bit
with an instrumented version of CPython that logs whenever the same
iterator is passed to multiple 'for' loops. I'll write up the results
in more detail, but the summary so far is that there seem to be ~8
places in the stdlib that would need preserve() calls added, and ~3 in
django. Maybe 2-3 hours and 1 hour of work respectively to fix?

It's not a perfect measure, and the cost certainly isn't zero, but
it's at a completely different order of magnitude than the str
changes. Among other things, this is a transition that allows for
gradual opt-in via a __future__, and fine-grained warnings pointing
you at what you need to fix, neither of which were possible for
str->unicode.

> 3. Though perhaps common for what you do, the need for the change is
> extremely rare in the overall Python world.  Iterators depending on an
> external resource are rare (< 1%, I would think).  Incomplete iteration is
> also rare (also < 1%, I think).  And resources do not always need to
> releases immediately.

This could equally well be an argument that the change is fine -- e.g.
if you're always doing complete iteration, or just iterating over
lists and stuff, then it literally doesn't affect you at all either
way...

> 4. Previous proposals to officially augment the iterator protocol, even with
> optional methods, have been rejected, and I think this one should be too.
>
> a. Add .__len__ as an option.  We added __length_hint__, which an iterator
> may implement, but which is not part of the iterator protocol. It is also
> ignored by bool().
>
> b., c. Add __bool__ and/or peek().  I posted a LookAhead wrapper class that
> implements both for most any iterable.  I suspect that the is rarely used.
>
>
>>   def read_newline_separated_json(path):
>>       with open(path) as file_handle:      # <-- with block
>>           for line in file_handle:
>>               yield json.loads(line)
>
>
> One problem with passing paths around is that it makes the receiving
> function hard to test.  I think functions should at least optionally take an
> iterable of lines, and make the open part optional.  But then closing should
> also be conditional.

Sure, that's all true, but this is the problem with tiny documentation
examples :-). The point here was to explain the surprising interaction
between generators and with blocks in the simplest way, not to
demonstrate the ideal solution to the problem of reading
newline-separated JSON. Everything you want is still doable in a
post-__iterclose__ world -- in particular, if you do

  for doc in read_newline_separated_json(lines_generator()):
      ...

then both iterators will be closed when the for loop exits. But if you
want to re-use the lines_generator, just write:

  it = lines_generator()
  for doc in read_newline_separated_json(preserve(it)):
      ...
  for more_lines in it:
      ...

> If the combination of 'with', 'for', and 'yield' do not work together, then
> do something else, rather than changing the meaning of 'for'. Moving
> responsibility for closing the file from 'with' to 'for', makes 'with'
> pretty useless, while overloading 'for' with something that is rarely
> needed.  This does not strike me as the right solution to the problem.
>
>>   for document in read_newline_separated_json(path):  # <-- outer for loop
>>       ...
>
>
> If the outer loop determines when the file should be closed, then why not
> open it there?  What fails with
>
> try:
>     lines = open(path)
>     gen = read_newline_separated_json(lines)
>     for doc in gen: do_something(doc)
> finally:
>     lines.close
>     # and/or gen.throw(...) to stop the generator.

Sure, that works in this trivial case, but they aren't all trivial
:-). See the example from my first email about a WSGI-like interface
where response handlers are generators: in that use case, your
suggestion that we avoid all resource management inside generators
would translate to: "webapps can't open files". (Or database
connections, proxy requests, ... or at least, can't hold them open
while streaming out response data.)

Or sticking to concrete examples, here's a toy-but-plausible generator
where the put-the-with-block-outside strategy seems rather difficult
to implement:

# Yields all lines in all files in 'directory' that contain the
substring 'needle'
def recursive_grep(directory, needle):
    for dirpath, _, filenames in os.walk(directory):
        for filename in filenames:
            with open(os.path.join(dirpath, filename)) as file_handle:
                for line in file_handle:
                    if needle in line:
                        yield line

-n

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the Python-ideas mailing list