[Python-ideas] Deterministic iterator cleanup

Nathaniel Smith njs at pobox.com
Wed Oct 19 14:11:53 EDT 2016


On Wed, Oct 19, 2016 at 10:08 AM, Neil Girdhar <mistersheik at gmail.com> wrote:
>
>
> On Wed, Oct 19, 2016 at 11:08 AM Todd <toddrjen at gmail.com> wrote:
>>
>> On Wed, Oct 19, 2016 at 3:38 AM, Neil Girdhar <mistersheik at gmail.com>
>> wrote:
>>>
>>> This is a very interesting proposal.  I just wanted to share something I
>>> found in my quick search:
>>>
>>>
>>> http://stackoverflow.com/questions/14797930/python-custom-iterator-close-a-file-on-stopiteration
>>>
>>> Could you explain why the accepted answer there doesn't address this
>>> issue?
>>>
>>> class Parse(object):
>>>     """A generator that iterates through a file"""
>>>     def __init__(self, path):
>>>         self.path = path
>>>
>>>   def __iter__(self):
>>>         with open(self.path) as f:
>>>             yield from f

BTW it may make this easier to read if we notice that it's essentially
a verbose way of writing:

def parse(path):
    with open(path) as f:
        yield from f

>>
>> I think the difference is that this new approach guarantees cleanup the
>> exact moment the loop ends, no matter how it ends.
>>
>> If I understand correctly, your approach will do cleanup when the loop
>> ends only if the iterator is exhausted.  But if someone zips it with a
>> shorter iterator, uses itertools.islice or something similar, breaks the
>> loop, returns inside the loop, or in some other way ends the loop before the
>> iterator is exhausted, the cleanup won't happen when the iterator is garbage
>> collected.  And for non-reference-counting python implementations, when this
>> happens is completely unpredictable.
>>
>> --
>
>
> I don't see that.  The "cleanup" will happen when collection is interrupted
> by an exception.  This has nothing to do with garbage collection either
> since the cleanup happens deterministically when the block is ended.  If
> this is the only example, then I would say this behavior is already provided
> and does not need to be added.

I think there might be a misunderstanding here. Consider code like
this, that breaks out from the middle of the for loop:

def use_that_generator():
    for line in parse(...):
        if found_the_line_we_want(line):
            break
    # -- mark --
    do_something_with_that_line(line)

With current Python, what will happen is that when we reach the marked
line, then the for loop has finished and will drop its reference to
the generator object. At this point, the garbage collector comes into
play. On CPython, with its reference counting collector, the garbage
collector will immediately collect the generator object, and then the
generator object's __del__ method will restart 'parse' by having the
last 'yield' raise a GeneratorExit, and *that* exception will trigger
the 'with' block's cleanup. But in order to get there, we're
absolutely depending on the garbage collector to inject that
GeneratorExit. And on an implementation like PyPy that doesn't use
reference counting, the generator object will become collect*ible* at
the marked line, but might not actually be collect*ed* for an
arbitrarily long time afterwards. And until it's collected, the file
will remain open. 'with' blocks guarantee that the resources they hold
will be cleaned up promptly when the enclosing stack frame gets
cleaned up, but for a 'with' block inside a generator then you still
need something to guarantee that the enclosing stack frame gets
cleaned up promptly!

This proposal is about providing that thing -- with __(a)iterclose__,
the end of the for loop immediately closes the generator object, so
the garbage collector doesn't need to get involved.

Essentially the same thing happens if we replace the 'break' with a
'raise'. Though with exceptions, things can actually get even messier,
even on CPython. Here's a similar example except that (a) it exits
early due to an exception (which then gets caught elsewhere), and (b)
the invocation of the generator function ended up being kind of long,
so I split the for loop into two lines with a temporary variable:

def use_that_generator2():
    it = parse("/a/really/really/really/really/really/really/really/long/path")
    for line in it:
        if not valid_format(line):
            raise ValueError()

def catch_the_exception():
    try:
        use_that_generator2()
    except ValueError:
        # -- mark --
        ...

Here the ValueError() is raised from use_that_generator2(), and then
caught in catch_the_exception(). At the marked line,
use_that_generator2's stack frame is still pinned in memory by the
exception's traceback. And that means that all the local variables are
also pinned in memory, including our temporary 'it'. Which means that
parse's stack frame is also pinned in memory, and the file is not
closed.

With the __(a)iterclose__ proposal, when the exception is thrown then
the 'for' loop in use_that_generator2() immediately closes the
generator object, which in turn triggers parse's 'with' block, and
that closes the file handle. And then after the file handle is closed,
the exception continues propagating. So at the marked line, it's still
the case that 'it' will be pinned in memory, but now 'it' is a closed
generator object that has already relinquished its resources.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the Python-ideas mailing list