[Python-Dev] Re: The iterator story

Mon, 22 Jul 2002 21:57:24 -0700 (PDT)

SYNOPSIS: a slight adjustment to the definition of consume()
yields a simple solution that addresses both the destruction
issue and the multiple-iteration issue, without introducing
any new syntax.

On Mon, 22 Jul 2002, Greg Ewing wrote:
> As someone pointed out, it's pretty rare that you actually *want* to
> consume the sequence. Usually the choice is between "I don't care" and
> "The sequence must NOT be consumed".

Sure, i'll go for that.  What i'm after is the ability to say
"i would like this sequence not to be consumed."

> Of the two varieties of for-loop in your proposal, for-in
> obviously corresponds to the "must not be consumed" case,
> leading one to suppose that you intend for-from to be used in
> the don't-care case.

Right.

> But now you seem to be suggesting that library routines
> should always use for-in, and that the caller should
> convert an iterator to a sequence if he knows it's okay
> to consume it:

The two are semantically equivalent proposals.  I explained
them both in the original message that i posted proposing
the solution.  The 'consume()' library routine is just another
way to express 'for-from' without using new syntax.

However, it is true that 'consume()' is more generally useful.
It would be good to have, whether or not we had new syntax.
I acknowledge that i did not realize this at the time i wrote
the earlier message, or i would have stated the 'consume()'
(then called 'seq()') proposal first and the for-from proposal
second, instead of the opposite.

That is why i am sticking to talking about the no-new-syntax
version of the proposal for now.  I apologize if it seems
that i am asking you to follow a moving target.  I would like
you to recognize, though, that the underlying concept is the
same -- the programmer has to signal when an iterator is being
used like a sequence.

> Okay, that seems reasonable -- explicit is better than
> implicit. But... consider the following two library
> routines:
>
>   def printout1(s):
>     for x in s:
>       print x
>
>   def printout2(s):
>     for x in s:
>       for y in s:
>         print x, y
[...]
> no exception will be raised if you call printout2(consume(s))
> by mistake.

Good point!  Clearly my proposal did not take care of this case.
(But there are solutions below; read on.)

Upon some reflection, though, it seems to me that this problem
is orthogonal to the proposal: forcing the programmer to declare
when destruction is allowed neither solves nor exacerbates the
problem of printout2().  consume() is about destruction, whereas
printout2() is about multiple iteration.

> To get any safety benefit from your proposed arrangement,
> it seems to me that you'd need to write printout1 as
>
>   def printout1(s):
>     "s must be an iterator"
>     for x from s:
>       print x

I'm afraid i don't see how this bears on the problem you just
described.  It still would not be possible to write a safe version
of printout2() in either (a) the world of the current Python with
iterators or (b) a world where for-in does not accept iterators
and consume() has been introduced.

One real solution to this problem is what Oren has been suggesting
all along -- raise an IteratorExhausted exception if you try to fetch
an element from an iterator that has already thrown StopIteration.
In printout2(), this exception would occur on the second time through
the inner loop.  This works, but we can do even better.

After some thought today, i realized that there is a second solution.
Thanks for leading me to it, Greg!  With consume(), the programmer
has declared that the iterator is okay to destroy.  But my definition
of consume() was incomplete.  One slight change solves the problem:

    consume(y) returns x such that iter(x) returns y the
    first time, and raises IteratorConsumedException thereafter.

Now we're all set!  If consume(it) is passed to printout2(), an
exception is raised immediately before any damage is done.  This
detects whether you attempt to *start* the iterator twice, which
makes more sense than detecting whether you hit the *end* of the
iterator twice.

The insight is that protection against multiple iteration belongs
in the implementation of __iter__, not in the iterator itself --
because the iterator doesn't know whether it can be restarted.
The *provider* of the iterator does.

> There's no doubt that it's very elegant theoretically,
> but in thinking through the implications, I'm not sure it
> would be all that helpful in practice, and might even
> turn out to be a nuisance if it requires putting in a
> lot of iter(x) and/or consume(x) calls.

It's not so bad.  You only have to say iter() or consume() in
exceptional cases, where you are specifically writing code to
manipulate iterators.  Everything else looks the same -- except
it's safe.

More importantly, neither iter() nor consume() need to be taught
on the first day of Python.

I think it all comes together quite nicely.  Here it is in summary:

    - Iterators just implement __next__.

    - Containers, and other things that want to be iterated over,
      just implement __iter__.

    - The new built-in routine consume(y) returns x such that iter(x)
      returns y the first time, and raises IteratorConsumedException
      thereafter.

    - (Other objects that only allow one-shot iteration can also raise
      IteratorConsumedException when their __iter__ is called twice.)

Advantages:

    1. "for-in" and "in" are safe to use -- no fear of destruction.

    2. One-shot iterators are safe against multiple iteration.

    3. Iterators don't have to implement a dummy __iter__ method
       returning self.

    4. The implementation of "for" stays exactly as it is now.

    5. Current implementations of iterators continue to work fine,
       if unsafely (but they're already unsafe).

    6. No new syntax.

    7. For-loops continue to work on containers exactly as they
       always have.

    8. Iterators don't have to maintain extra state to know that
       it's time to start throwing IteratorExhausted instead of
       StopIteration.

Items 1, 2, and 3 are distinct improvements over the current state
of affairs.  The only inconvenience is the case where an iterator
is being passed to a routine that expects a container; this is
still pretty rare yet, and this situation is easy to detect (hence,
the error message from "for" can explain what to do).  In this case,
you have to wrap consume() around the iterator to declare it okay
to consume.  And that's all.

The fact that it takes only a slight adjustment to the earlier proposal
to solve *both* the destruction problem and the multiple-iteration
problem has led me to be even more convinced that this is the "right
answer" -- in the sense that this is how i would design the protocol
if we were starting from scratch.

Now, i know we are not starting from scratch.  And i know Guido has
already said he doesn't want to solve this problem.  But, just in
case you are wondering, the migration path from here to there seems
pretty straightforward to me:

    1. When __next__() is not present, call next() and issue a warning.

    2. In the next version, deprecate next() in favour of __next__().

    3. Add consume() and IteratorConsumedException to built-ins.

    4. Deprecate the dummy __iter__() method on iterators.

    5. Throw a party and consume(mass_quantities).

-- ?!ng

"Most things are, in fact, slippery slopes.  And if you start backing off
from one thing because it's a slippery slope, who knows where you'll stop?"
    -- Sean M. Burke