[Python-Dev] The iterator story

Fri, 19 Jul 2002 17:10:45 -0400

> Here is a summary of the whole iterator picture as i currently see it.
> This is necessarily subjective, but i will try to be precise so that
> it's clear where i'm making a value judgement and where i'm trying to
> state fact, and so we can pinpoint areas where we agree and disagree.
> 
> In the subjective sections, i have marked with [@] the places where
> i solicit agreement or disagreement.
> 
> I would like to know your opinions on the issues listed below,
> and on the places marked [@].
> 
> 
> Definitions (objective)
> -----------------------
> 
> Container: a thing that provides non-destructive access to a varying
> number of other things.
> 
>     Why "non-destructive"?  Because i don't expect that merely looking
>     at the contents will cause a container to be altered.  For example,
>     i expect to be able to look inside a container, see that there are
>     five elements; leave it alone for a while, come back to it later
>     and observe once again that there are five elements.
> 
>     Consequently, a file object is not a container in general.  Given
>     a file object, you cannot look at it to see if it contains an "A",
>     and then later look at it once again to see if it contains an "A"
>     and get the same result.  If you could seek, then you could do
>     this, but not all files support seeking.  Even if you could seek,
>     the act of reading the file would still alter the file object.
> 
>     The file object provides no way of getting at the contents without
>     mutating itself.  According to my definition, it's fine for a
>     container to have ways of mutating itself; but there has to be
>     *some* way of getting the contents without mutating the container,
>     or it just ain't a container to me.
> 
>     A file object is better described as a stream.  Hypothetically
>     one could create an interface to seekable files that offered some
>     non-mutating read operations; this would cause the file to look
>     more like an array of bytes, and i would find it appropriate to
>     call that interface a container.
> 
> Iterator: a thing that you can poke (i.e. send a no-argument message),
> where each time you poke it, it either yields something or announces
> that it is exhausted.
> 
>     For an iterator to mutate itself every time you poke it is not
>     part of my definition.  But the only non-mutating iterator would
>     be an iterator that returns the same thing forever, or an iterator
>     that is always exhausted.  So most iterators usually mutate.
> 
>     Some iterators are associated with a container, but not all.
> 
>     There can be many kinds of iterators associated with a container.
>     The most natural kind is one that yields the elements of the
>     container, one by one, mutating itself each time it is poked,
>     until it has yielded all of the elements of the container and
>     announces exhaustion.
> 
> A Container's Natural Iterator: an iterator that yields the elements
> of the container, one by one, in the order that makes the most sense
> for the container.  If the container has a finite size n, then the
> iterator can be poked exactly n times, and thereafter it is exhausted.

Sure.

But I note that there are hybrids, and I think files (at least
seekable files) fall in the hybrid category.  Other examples of
hybrids:

- Some dbm variants (e.g. dbhash and gdbm) provide first() and next()
  or firstkey() and nextkey() methods that combine iterator state with
  the container object.  These objects simply provide two different
  interfaces, a containerish interface (__getitem__ in fact), and an
  iteratorish interface.

- Before we invented the concept of interators, I believe it was
  common for tree data structures to provide iterators that didn't put
  the iteration state in a separate object, but simply kept a pointer
  to the current node of the iteration pass somewhere in the root of
  the tree.

The idea that a container also has some iterator state, and that you
have to do something simple (like calling firstkey() or seek(0)) to
reset the iterator, is quite common.

You may argue that this is poor design that should be fixed, and in
general I would agree (the firstkey()/nextkey() protocol in particular
is clumsy to use), but it is common nevertheless, and sometimes common
usage patterns as well as the relative cost of random access make it a
cood compromise sometimes.  For example, while a tape file is a
container in the sense that reading the data doesn't destroy it, it's
very heavily geared towards sequential access, and you can't
realistically have two iterators going over the same tape at once.  If
you're too young to remember, think of files on CD media -- there,
random access, while possible, is several orders of magnitude slower
than sequential access (better than tape, but a lot worse than regular
magnetic hard drives).

> Issues (objective)
> ------------------
> 
> I alluded to a set of issues in an earlier message, and i'll begin
> there, by defining what i meant more precisely.
> 
> The Destructive-For Issue:
> 
>     In most languages i can think of, and in Python for the most
>     part, a statement such as "for x in y: print x" is a
>     non-destructive operation on y.  Repeating "for x in y: print x"
>     will produce exactly the same results once more.
> 
>     For pre-iterator versions of Python, this fails to be true only
>     if y's __getitem__ method mutates y.  The introduction of
>     iterators has caused this to now be untrue when y is any iterator.
> 
>     The issue is, should "for" be non-destructive?

I don't see the benefit.  We've done this for years and the only
conceptual problem was the abuse of __getitem__, not the
destructiveness of the for-loop.

> The Destructive-In Issue:
> 
>     Notice that the iteration that takes place for the "in" operator
>     is implemented in the same way as "for".  So if "for" destroys
>     its second operand, so will "in".
> 
>     The issue is, should "in" be non-destructive?

If it can't be helped otherwise, sure, why not?

>     (Similar issues exist for built-ins that iterate, like list().)

At least list() keeps a copy of all the items, so you can then iterate
over them as often as you want. :-)

> The __iter__-On-Iterators Issue:
> 
>     Some people have mentioned that the presence of an __iter__()
>     method is a way of signifying that an object supports the
>     iterator protocol.  It has been said that this is necessary
>     because the presence of a "next()" method is not sufficiently
>     distinguishing.

Not me.

>     Some have said that __iter__() is a completely distinct protocol
>     from the iterator protocol.
> 
>     The issue is, what is __iter__() really for?

To support iter() and for-loops.

>     And secondarily, if it is not part of the iterator protocol,
>     then should we require __iter__() on iterators, and why?

So that you can use an iterator in a for-loop.

> The __next__-Naming Issue:
> 
>     The iteration method is currently called "next()".
> 
>     Previous candidates for the name of this method were "next",
>     "__next__", and "__call__".  After some previous debate,
>     it was pronounced to be "next()".
> 
>     There are concerns that "next()" might collide with existing
>     methods named "next()".  There is also a concern that "next()"
>     is inconsistent because it is the only type-slot-method that
>     does not have a __special__ name.
> 
>     The issue is, should it be called "next" or "__next__"?

That's a separate issue, and cleans up only a small wart that in
practice hasn't hurt anybody AFAIK.

> My Positions (subjective)
> -------------------------
> 
> I believe that "for" and "in" and list() should be non-destructive.
> I believe that __iter__() should not be required on iterators.
> I believe that __next__() is a better name than next().
> 
> Destructive-For, Destructive-In:
> 
>     I think "for" should be non-destructive because that's the way
>     it has almost always behaved, and that's the way it behaves in
>     any other language [@] i can think of.
> 
>     For a container's __getitem__ method to mutate the container is,
>     in my opinion, bad behaviour.  In pre-iterator Python, we needed
>     some way to allow the convenience of "for" on user-implemented
>     containers.  So "for" supported a special protocol where it would
>     call __getitem__ with increasing integers starting from 0 until
>     it hit an IndexError.  This protocol works great for sequence-like
>     containers that were indexable by integers.
> 
>     But other containers had to be hacked somewhat to make them fit.
>     For example, there was no good way to do "for" over a dictionary-like
>     container.  If you attempted "for" over a user-implemented dictionary,
>     you got a really weird "KeyError: 0", which only made sense if you
>     understood that the "for" loop was attempting __getitem__(0).
> 
>     (Hey!  I just noticed that
> 
>         from UserDict import UserDict
>         for k in UserDict(): print k
> 
>     still produces "KeyError: 0"!  This oughta be fixed...)

Check the CVS logs.  At one point before 2.2 was released, UserDict
has a __iter__ method.  But then SF bug 448153 was filed, presenting
evidence that this broke previously working code.  So a separate
class, IterableUserDict, was added that has the __iter__ method.

I agree that this is less than ideal, but that's life.

>     If you wanted to support "for" on something else, sometimes you
>     would have to make __getitem__ mutate the object, like it does
>     in the fileinput module.  But then the user has to know that
>     this object is a special case: "for" only works the first time.

This was and still is widespread.  There are a lot of objects that
have a way to return an iterators (old style using fake __getitem__,
and new ones using __iter__ and next) that are intended to be looped
over, once.  I have no desire to deprecate this behavior, since (a) it
would be a major upheaval for the user community (a lot worse than
integer division), and (b) I don't see that "fixing" this prevents a
particular category of programming errors.

>     When iterators were introduced, i believed they were supposed
>     to solve this problem.  Currently, they don't.

No, they solve the conceptual ugliness of providing a __getitem__ that
can only be called once.  The new rule is, if you provide __getitem__,
it must support random access; otherwise, you should provide __iter__.

>     Currently, "in" can even be destructive.  This is more serious.
>     While one could argue that it's not so strange for
> 
>         for x in y: ...
> 
>     to alter y (even though i do think it is strange), i believe
>     just about anyone would find it very counterintuitive for
> 
>         if x in y:
> 
>     to alter y.  [@]

That falls in the category of "then don't do that".

> __iter__-On-Iterators:
> 
>     I believe __iter__ is not a type flag.  As i argued previously,
>     i think that looking for the presence of methods that don't actually
>     implement a protocol is a poor way to check for protocol support.
>     And as things stand, the presence of __iter__ doesn't even work [@]
>     as a type flag.

And I never said it was a type flag.  I'm tired of repeating myself,
but you keep repeating this broken argument, so I have to keep
correcting you.

>     There are objects with __iter__ that are not iterators (like most
>     containers).  And there are objects without __iter__ that work as
>     iterators.  I know you can legislate the latter away, but i think
>     such legislation would amount to fighting the programmers -- and
>     it is infeasible [@] to enforce the presence of __iter__ in practice.

I think having next without having __iter__ is like having __getitem__
without having __len__.  There are corner cases where you might get
away with this because you know it won't be called, but (as I've
repeated umpteen times now), a for-loop over an iterator is a common
idiom.

>     Based on Guido's positive response, in which he asked me to make
>     an addition to the PEP, i believe Guido agrees with me that
>     __iter__ is distinct from the protocol of an iterator.  This
>     surprised me because it runs counter to the philosophy previously
>     expressed in the PEP.

I recognize that they are separate protocols.  But because I like the
for-loop as a convenient way to get all of the elements of an
iterator, I want iterators to support __iter__.

The alternative would be for iter() to see if the object implements
next (after finding that it has neither __iter__ nor __getitem__), and
return the object itself unchanged.  If we had picked __next__ instead
of 'next', that would perhaps been my choice (though I might *still*
have recommended implementing __iter__ returning self, to avoid two
failing getattr calls).

>     Now suppose we agree that __iter__ and next are distinct protocols.
>     Then why require iterators to support both?  The only reason we
>     would want __iter__ on iterators is so that we can use "for" [@]
>     with an iterator as the second operand.

Right.  Finally you got it.

>     I have just argued, above, that it's *not* a good idea for "for"
>     and "in" to be destructive.  Since most iterators self-mutate,
>     it follows that it's not advisable to use an iterator directly
>     as the second operand of a "for" or "in".
> 
>     I realize this seems radical!  This may be the most controversial
>     point i have made.  But if you accept that "in" should not
>     destroy its second argument, the conclusion is unavoidable.

Since I have little sympathy for your premise, this conclusion is all
from unavoidable for me. :-)

> __next__-Naming:
> 
>     I think the potential for collision, though small, is significant,
>     and this makes "__next__" a better choice than "next".  A built-in
>     function next() should be introduced; this function would call the
>     tp_iternext slot, and for instance objects tp_iternext would call
>     the __next__ method implemented in Python.
> 
>     The connection between this issue and the __iter__ issue is that,
>     if next() were renamed to __next__(), the argument that __iter__
>     is needed as a flag would also go away.

I really wish we had had this insight 18 months ago.  Right now, it's
too late.  Dragging all the other stuff in doesn't strengthen the
argument for fixing it now.

> The Current PEP (objective)
> ---------------------------
> 
> The current PEP takes the position that "for" and "in" can be
> destructive; that __iter__() and next() represent two distinct
> protocols, yet iterators are required to support both; and that
> the name of the method on iterators is called "next()".
> 
> 
> My Ideal Protocol (subjective)
> ------------------------------
> 
> So by now the biggest question/objection you probably have is
> "if i can't use an iterator with 'for', then how can i use it?"
> 
> The answer is that "for" is a great way to iterate over things;
> it's just that it iterates over containers and i want to preserve
> that.  We need a different way to iterate over iterators.
> 
> In my ideal world, we would allow a new form of "for", such as
> 
>     for line from file:
>         print line
> 
> The use if "from" instead of "in" would imply that we were
> (destructively) pulling things out of the iterator, and would
> remove any possible parallel to the test "x in y", which should
> rightly remain non-destructive.

Alternative syntaxes for for-loops have been proposed as solutions to
all sorts of things (e.g. what's called enumerate() in 2.3, and a
simplified syntax for range(), and probably other things).

I'm not keen on this.  I don't want to user-test it, but I expect that
it's too subtle a difference, and that we would see Aha! experiences
of the kind "Oh, it's a for-*from* loop!  I never noticed that, I
always read it as a for-*in* loop!  That explains the broken behavior."

> Here's the whole deal:
> 
>     - Iterators provide just one method, __next__().
> 
>     - The built-in next() calls tp_iternext.  For instances,
>       tp_iternext calls __next__.
> 
>     - Objects wanting to be iterated over provide just one method,
>       __iter__().  Some of these are containers, but not all.
> 
>     - The built-in iter(foo) calls tp_iter.  For instances,
>       tp_iter calls __iter__.
> 
>     - "for x in y" gets iter(y) and uses it as an iterator.
> 
>     - "for x from y" just uses y as the iterator.
> 
> That's it.
> 
> Benefits:
> 
>     - We have a nice clean division between containers and iterators.
> 
>     - When you see "for x in y" you know that y is a container.
> 
>     - When you see "for x from y" you know that y is an iterator.
> 
>     - "for x in y" never destroys y.
> 
>     - "if x in y" never destroys y.
> 
>     - If you have an object that is container-like, you can add
>       an __iter__ method that gives its natural iterator.  If
>       you want, you can supply more iterators that do different
>       things; no problem.  No one using your object is confused
>       about whether it mutates.
> 
>     - If you have an object that is cursor-like or stream-like,
>       you can safely make it into an iterator by adding __next__.
>       No one using your object is confused about whether it mutates.
> 
> Other notes:
> 
>     - Iterator algebra still works fine, and is still easy to write:
> 
>         def alternate(it):
>             while 1:
>                 yield next(it)
>                 next(it)
> 
>     - The file problem has a consistent solution.  Instead of writing
>       "for line in file" you write
> 
>         for line from file:
>             print line
> 
>       Being forced to write "from" signals to you that the file is
>       eaten up.  There is no expectation that "for line from file"
>       will work again.
> 
>       The best would be a convenience function "readlines", to
>       make this even clearer:
> 
>         for line in readlines("foo.txt"):
>             print line
> 
>       Now you can do this as many times as you want, and there is
>       no possibility of confusion; there is no file object on which
>       to call methods that might mess up the reading of lines.
> 
> 
> My Not-So-Ideal Protocol
> ------------------------
> 
> All right.  So new syntax may be hard to swallow.  An alternative
> is to introduce an adapter that turns an iterator into something
> that "for" will accept -- that is, the opposite of iter().
> 
>     - The built-in seq(it) returns x such that iter(x) yields it.
> 
> Then instead of writing
> 
>     for x from it:
> 
> you would write
> 
>     for x in seq(it):
> 
> and the rest would be the same.  The use of "seq" here is what
> would flag the fact that "it" will be destroyed.

I don't feel I have to drive it home any further, so I'll leave these
last few paragraphs without comments.

--Guido van Rossum (home page: http://www.python.org/~guido/)