[Python-ideas] Deterministic iterator cleanup
Neil Girdhar
mistersheik at gmail.com
Fri Oct 28 03:11:10 EDT 2016
On Tuesday, October 25, 2016 at 6:26:17 PM UTC-4, Nathaniel Smith wrote:
>
> On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncog... at gmail.com
> <javascript:>> wrote:
> > On 20 October 2016 at 07:02, Nathaniel Smith <n... at pobox.com
> <javascript:>> wrote:
> >> The first change is to replace the outer for loop with a while/pop
> >> loop, so that if an exception occurs we'll know which iterables remain
> >> to be processed:
> >>
> >> def chain(*iterables):
> >> try:
> >> while iterables:
> >> for element in iterables.pop(0):
> >> yield element
> >> ...
> >>
> >> Now, what do we do if an exception does occur? We need to call
> >> iterclose on all of the remaining iterables, but the tricky bit is
> >> that this might itself raise new exceptions. If this happens, we don't
> >> want to abort early; instead, we want to continue until we've closed
> >> all the iterables, and then raise a chained exception. Basically what
> >> we want is:
> >>
> >> def chain(*iterables):
> >> try:
> >> while iterables:
> >> for element in iterables.pop(0):
> >> yield element
> >> finally:
> >> try:
> >> operators.iterclose(iter(iterables[0]))
> >> finally:
> >> try:
> >> operators.iterclose(iter(iterables[1]))
> >> finally:
> >> try:
> >> operators.iterclose(iter(iterables[2]))
> >> finally:
> >> ...
> >>
> >> but of course that's not valid syntax. Fortunately, it's not too hard
> >> to rewrite that into real Python -- but it's a little dense:
> >>
> >> def chain(*iterables):
> >> try:
> >> while iterables:
> >> for element in iterables.pop(0):
> >> yield element
> >> # This is equivalent to the nested-finally chain above:
> >> except BaseException as last_exc:
> >> for iterable in iterables:
> >> try:
> >> operators.iterclose(iter(iterable))
> >> except BaseException as new_exc:
> >> if new_exc.__context__ is None:
> >> new_exc.__context__ = last_exc
> >> last_exc = new_exc
> >> raise last_exc
> >>
> >> It's probably worth wrapping that bottom part into an iterclose_all()
> >> helper, since the pattern probably occurs in other cases as well.
> >> (Actually, now that I think about it, the map() example in the text
> >> should be doing this instead of what it's currently doing... I'll fix
> >> that.)
> >
> > At this point your code is starting to look a whole lot like the code
> > in contextlib.ExitStack.__exit__ :)
>
> One of the versions I tried but didn't include in my email used
> ExitStack :-). It turns out not to work here: the problem is that we
> effectively need to enter *all* the contexts before unwinding, even if
> trying to enter one of them fails. ExitStack is nested like (try (try
> (try ... finally) finally) finally), and we need (try finally (try
> finally (try finally ...))) But this is just a small side-point
> anyway, since most code is not implementing complicated
> meta-iterators; I'll address your real proposal below.
>
> > Accordingly, I'm going to suggest that while I agree the problem you
> > describe is one that genuinely emerges in large production
> > applications and other complex systems, this particular solution is
> > simply far too intrusive to be accepted as a language change for
> > Python - you're talking a fundamental change to the meaning of
> > iteration for the sake of the relatively small portion of the
> > community that either work on such complex services, or insist on
> > writing their code as if it might become part of such a service, even
> > when it currently isn't. Given that simple applications vastly
> > outnumber complex ones, and always will, I think making such a change
> > would be a bad trade-off that didn't come close to justifying the
> > costs imposed on the rest of the ecosystem to adjust to it.
> >
> > A potentially more fruitful direction of research to pursue for 3.7
> > would be the notion of "frame local resources", where each Python
> > level execution frame implicitly provided a lazily instantiated
> > ExitStack instance (or an equivalent) for resource management.
> > Assuming that it offered an "enter_frame_context" function that mapped
> > to "contextlib.ExitStack.enter_context", such a system would let us do
> > things like:
>
> So basically a 'with expression', that gives up the block syntax --
> taking its scope from the current function instead -- in return for
> being usable in expression context? That's a really interesting, and I
> see the intuition that it might be less disruptive if our implicit
> iterclose calls are scoped to the function rather than the 'for' loop.
>
> But having thought about it and investigated some... I don't think
> function-scoping addresses my problem, and I don't see evidence that
> it's meaningfully less disruptive to existing code.
>
> First, "my problem":
>
> Obviously, Python's a language that should be usable for folks doing
> one-off scripts, and for paranoid folks trying to write robust complex
> systems, and for everyone in between -- these are all really important
> constituencies. And unfortunately, there is a trade-off here, where
> the changes we're discussing effect these constituencies differently.
> But it's not just a matter of shifting around a fixed amount of pain;
> the *quality* of the pain really changes under the different
> proposals.
>
> In the status quo:
> - for one-off scripts: you can just let the GC worry about generator
> and file handle cleanup, re-use iterators, whatever, it's cool
> - for robust systems: because it's the *caller's* responsibility to
> ensure that iterators are cleaned up, you... kinda can't really use
> generators without -- pick one -- (a) draconian style guides (like
> forbidding 'with' inside generators or forbidding bare 'for' loops
> entirely), (b) lots of auditing (every time you write a 'for' loop, go
> read the source to the generator you're iterating over -- no
> modularity for you and let's hope the answer doesn't change!), or (c)
> introducing really subtle bugs. Or all of the above. It's true that a
> lot of the time you can ignore this problem and get away with it one
> way or another, but if you're trying to write robust code then this
> doesn't really help -- it's like saying the footgun only has 1 bullet
> in the chamber. Not as reassuring as you'd think. It's like if every
> time you called a function, you had to explicitly say whether you
> wanted exception handling to be enabled inside that function, and if
> you forgot then the interpreter might just skip the 'finally' blocks
> while unwinding. There's just *isn't* a good solution available.
>
> In my proposal (for-scoped-iterclose):
> - for robust systems: life is great -- you're still stopping to think
> a little about cleanup every time you use an iterator (because that's
> what it means to write robust code!), but since the iterators now know
> when they need cleanup and regular 'for' loops know how to invoke it,
> then 99% of the time (i.e., whenever you don't intend to re-use an
> iterator) you can be confident that just writing 'for' will do exactly
> the right thing, and the other 1% of the time (when you do want to
> re-use an iterator), you already *know* you're doing something clever.
> So the cognitive overhead on each for-loop is really low.
> - for one-off scripts: ~99% of the time (actual measurement, see
> below) everything just works, except maybe a little bit better. 1% of
> the time, you deploy the clever trick of re-using an iterator with
> multiple for loops, and it breaks, so this is some pain. Here's what
> you see:
>
> gen_obj = ...
> for first_line in gen_obj:
> break
> for lines in gen_obj:
> ...
>
> Traceback (most recent call last):
> File "/tmp/foo.py", line 5, in <module>
> for lines in gen_obj:
> AlreadyClosedIteratorError: this iterator was already closed,
> possibly by a previous 'for' loop. (Maybe you want
> itertools.preserve?)
>
> (We could even have a PYTHONDEBUG flag that when enabled makes that
> error message include the file:line of the previous 'for' loop that
> called __iterclose__.)
>
> So this is pain! But the pain is (a) rare, not pervasive, (b)
> immediately obvious (an exception, the code doesn't work at all), not
> subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
> is reliable. It's a totally different type of pain than the pain that
> we currently impose on folks who want to write robust code.
>
> Now compare to the new proposal (function-scoped-iterclose):
>
> - For those who want robust cleanup: Usually, I only need an iterator
> for as long as I'm iterating over it; that may or may not correspond
> to the end of the function (often won't). When these don't coincide,
> it can cause problems. E.g., consider the original example from my
> proposal:
>
> def read_newline_separated_json(path):
> with open(path) as f:
> for line in f:
> yield json.loads(line)
>
> but now suppose that I'm a Data Scientist (tm) so instead of having 1
> file full of newline-separated JSON, I have a 100 gigabytes worth of
> the stuff stored in lots of files in a directory tree. Well, that's no
> problem, I'll just wrap that generator:
>
> def read_newline_separated_json_tree(tree):
> for root, _, paths in os.walk(tree):
> for path in paths:
> for document in read_newline_separated_json(join(root,
> path)):
> yield document
>
>
> And then I'll run it on PyPy, because that's what you do when you have
> 100 GB of string processing, and... it'll crash, because the call to
> read_newline_separated_tree ends up doing thousands of calls to
> read_newline_separated_json, but never cleans up any of them up until
> the function exits, so eventually we run out of file descriptors.
>
I still don't understand why you can't write it like this:
def read_newline_separated_json_tree(tree):
for root, _, paths in os.walk(tree):
for path in paths:
with read_newline_separated_json(join(root, path)) as iterable:
yield from iterable
Zero extra lines. Works today. Does everything you want.
>
> A similar situation arises in the main loop of something like an HTTP
> server:
>
> while True:
> request = read_request(sock)
> for response_chunk in application_handler(request):
> send_response_chunk(sock)
>
Same thing:
while True:
request = read_request(sock)
with application_handler(request) as iterable:
for response_chunk in iterable:
send_response_chunk(sock)
I'll stop posting about this, but I don't see the motivation behind this
proposals except replacing one explicit context management line with a
hidden "line" of cognitive overhead. I think the solution is to stop
returning an iterable when you have state needing a cleanup. Instead,
return a context manager and force the caller to open it to get at the
iterable.
Best,
Neil
>
> Here we'll accumulate arbitrary numbers of un-closed
> application_handler generators attached to the stack frame, which is
> no good at all. And this has the interesting failure mode that you'll
> probably miss it in testing, because most clients will only re-use a
> connection a small number of times.
> So what this means is that every time I write a for loop, I can't just
> do a quick "am I going to break out of the for-loop and then re-use
> this iterator?" check -- I have to stop and think about whether this
> for-loop is nested inside some other loop, etc. And, again, if I get
> it wrong, then it's a subtle bug that will bite me later. It's true
> that with the status quo, we need to wrap, X% of for-loops with 'with'
> blocks, and with this proposal that number would drop to, I don't
> know, (X/5)% or something. But that's not the most important cost: the
> most important cost is the cognitive overhead of figuring out which
> for-loops need the special treatment, and in this proposal that
> checking is actually *more* complicated than the status quo.
>
> - For those who just want to write a quick script and not think about
> it: here's a script that does repeated partial for-loops over a
> generator object:
>
>
> https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79
>
> (and note that the generator object even has an ineffective 'with
> open(...)' block inside it!)
>
> With the function-scoped-iterclose, this script would continue to work
> as it does now. Excellent.
>
> But, suppose that I decide that that main() function is really
> complicated and that it would be better to refactor some of those
> loops out into helper functions. (Probably actually true in this
> example.) So I do that and... suddenly the code breaks. And in a
> rather confusing way, because it has to do with this complicated
> long-distance interaction between two different 'for' loops *and*
> where they're placed with respect to the original function versus the
> helper function.
>
> If I were an intermediate-level Python student (and I'm pretty sure
> anyone who is starting to get clever with re-using iterators counts as
> "intermediate level"), then I'm pretty sure I'd actually prefer the
> immediate obvious feedback from the for-scoped-iterclose. This would
> actually be a good time to teach folks about this aspect of resource
> handling, actually -- it's certainly an important thing to learn
> eventually on your way to Python mastery, even if it isn't needed for
> every script.
>
> In the pypy-dev thread about this proposal, there's some very
> distressed emails from someone who's been writing Python for a long
> time but only just realized that generator cleanup relies on the
> garbage collector:
>
> https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html
> https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html
>
> It's unpleasant to have the rug pulled out from under you like this
> and suddenly realize that you might have to go re-evaluate all the
> code you've ever written, and making for loops safe-by-default and
> fail-fast-when-unsafe avoids that.
>
> Anyway, in summary: function-scoped-iterclose doesn't seem to
> accomplish my goal of getting rid of the *type* of pain involved when
> you have to run a background thread in your brain that's doing
> constant paranoid checking every time you write a for loop. Instead it
> arguably takes that type of pain and spreads it around both the
> experts and the novices :-/.
>
> -------------
>
> Now, let's look at some evidence about how disruptive the two
> proposals are for real code:
>
> As mentioned else-thread, I wrote a stupid little CPython hack [1] to
> report when the same iterator object gets passed to multiple 'for'
> loops, and ran the CPython and Django testsuites with it [2]. Looking
> just at generator objects [3], across these two large codebases there
> are exactly 4 places where this happens. (Rough idea of prevalence:
> these 4 places together account for a total of 8 'for' loops; this is
> out of a total of 11,503 'for' loops total, of which 665 involve
> generator objects.) The 4 places are:
>
> 1) CPython's Lib/test/test_collections.py:1135,
> Lib/_collections_abc.py:378
>
> This appears to be a bug in the CPython test suite -- the little MySet
> class does 'def __init__(self, itr): self.contents = itr', which
> assumes that itr is a container that can be repeatedly iterated. But a
> bunch of the methods on collections.abc.Set like to pass in a
> generator object here instead, which breaks everything. If repeated
> 'for' loops on generators raised an error then this bug would have
> been caught much sooner.
>
> 2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75
>
> Discussed above -- as written, for-scoped-iterclose would break this
> script, but function-scoped-iterclose would not, so here
> function-scoped-iterclose wins.
>
> 3) Django django/utils/regex_helper.py:236
>
> This code is very similar to the previous example in its general
> outline, except that the 'for' loops *have* been factored out into
> utility functions. So in this case for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive.
>
> 4) CPython's Lib/test/test_generators.py:723
>
> I have to admit I cannot figure out what this code is doing, besides
> showing off :-). But the different 'for' loops are in different stack
> frames, so I'm pretty sure that for-scoped-iterclose and
> function-scoped-iterclose would be equally disruptive.
>
> Obviously there's a bias here in that these are still relatively
> "serious" libraries; I don't have a big corpus of one-off scripts that
> are just a big __main__, though gprof2html.py isn't far from that. (If
> anyone knows where to find such a thing let me know...) But still, the
> tally here is that out of 4 examples, we have 1 subtle bug that
> iterclose might have caught, 2 cases where for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive, and only 1 where
> function-scoped-iterclose is less disruptive -- and in that case it's
> arguably just avoiding an obvious error now in favor of a more
> confusing error later.
>
> If this reduced the backwards-incompatible cases by a factor of, like,
> 10x or 100x, then that would be a pretty strong argument in its favor.
> But it seems to be more like... 1.5x.
>
> -n
>
> [1]
> https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2
> [2] CPython: revision b0a272709b from the github mirror; Django:
> revision 90c3b11e87
> [3] I also looked at "all iterators" and "all iterators with .close
> methods", but this email is long enough... basically the pattern is
> the same: there are another 13 'for' loops that involve repeated
> iteration over non-generator objects, and they're roughly equally
> split between spurious effects due to bugs in the CPython test-suite
> or my instrumentation, cases where for-scoped-iterclose and
> function-scoped-iterclose both cause the same problems, and cases
> where function-scoped-iterclose is less disruptive.
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> Python-ideas mailing list
> Python... at python.org <javascript:>
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161028/49b2a4ee/attachment-0001.html>
More information about the Python-ideas
mailing list