[Python-Dev] The iterator story

Fri, 19 Jul 2002 04:28:32 -0700 (PDT)

Here is a summary of the whole iterator picture as i currently see it.
This is necessarily subjective, but i will try to be precise so that
it's clear where i'm making a value judgement and where i'm trying to
state fact, and so we can pinpoint areas where we agree and disagree.

In the subjective sections, i have marked with [@] the places where
i solicit agreement or disagreement.

I would like to know your opinions on the issues listed below,
and on the places marked [@].

Definitions (objective)
-----------------------

Container: a thing that provides non-destructive access to a varying
number of other things.

    Why "non-destructive"?  Because i don't expect that merely looking
    at the contents will cause a container to be altered.  For example,
    i expect to be able to look inside a container, see that there are
    five elements; leave it alone for a while, come back to it later
    and observe once again that there are five elements.

    Consequently, a file object is not a container in general.  Given
    a file object, you cannot look at it to see if it contains an "A",
    and then later look at it once again to see if it contains an "A"
    and get the same result.  If you could seek, then you could do
    this, but not all files support seeking.  Even if you could seek,
    the act of reading the file would still alter the file object.

    The file object provides no way of getting at the contents without
    mutating itself.  According to my definition, it's fine for a
    container to have ways of mutating itself; but there has to be
    *some* way of getting the contents without mutating the container,
    or it just ain't a container to me.

    A file object is better described as a stream.  Hypothetically
    one could create an interface to seekable files that offered some
    non-mutating read operations; this would cause the file to look
    more like an array of bytes, and i would find it appropriate to
    call that interface a container.

Iterator: a thing that you can poke (i.e. send a no-argument message),
where each time you poke it, it either yields something or announces
that it is exhausted.

    For an iterator to mutate itself every time you poke it is not
    part of my definition.  But the only non-mutating iterator would
    be an iterator that returns the same thing forever, or an iterator
    that is always exhausted.  So most iterators usually mutate.

    Some iterators are associated with a container, but not all.

    There can be many kinds of iterators associated with a container.
    The most natural kind is one that yields the elements of the
    container, one by one, mutating itself each time it is poked,
    until it has yielded all of the elements of the container and
    announces exhaustion.

A Container's Natural Iterator: an iterator that yields the elements
of the container, one by one, in the order that makes the most sense
for the container.  If the container has a finite size n, then the
iterator can be poked exactly n times, and thereafter it is exhausted.

Issues (objective)
------------------

I alluded to a set of issues in an earlier message, and i'll begin
there, by defining what i meant more precisely.

The Destructive-For Issue:

    In most languages i can think of, and in Python for the most
    part, a statement such as "for x in y: print x" is a
    non-destructive operation on y.  Repeating "for x in y: print x"
    will produce exactly the same results once more.

    For pre-iterator versions of Python, this fails to be true only
    if y's __getitem__ method mutates y.  The introduction of
    iterators has caused this to now be untrue when y is any iterator.

    The issue is, should "for" be non-destructive?

The Destructive-In Issue:

    Notice that the iteration that takes place for the "in" operator
    is implemented in the same way as "for".  So if "for" destroys
    its second operand, so will "in".

    The issue is, should "in" be non-destructive?

    (Similar issues exist for built-ins that iterate, like list().)

The __iter__-On-Iterators Issue:

    Some people have mentioned that the presence of an __iter__()
    method is a way of signifying that an object supports the
    iterator protocol.  It has been said that this is necessary
    because the presence of a "next()" method is not sufficiently
    distinguishing.

    Some have said that __iter__() is a completely distinct protocol
    from the iterator protocol.

    The issue is, what is __iter__() really for?

    And secondarily, if it is not part of the iterator protocol,
    then should we require __iter__() on iterators, and why?

The __next__-Naming Issue:

    The iteration method is currently called "next()".

    Previous candidates for the name of this method were "next",
    "__next__", and "__call__".  After some previous debate,
    it was pronounced to be "next()".

    There are concerns that "next()" might collide with existing
    methods named "next()".  There is also a concern that "next()"
    is inconsistent because it is the only type-slot-method that
    does not have a __special__ name.

    The issue is, should it be called "next" or "__next__"?

My Positions (subjective)
-------------------------

I believe that "for" and "in" and list() should be non-destructive.
I believe that __iter__() should not be required on iterators.
I believe that __next__() is a better name than next().

Destructive-For, Destructive-In:

    I think "for" should be non-destructive because that's the way
    it has almost always behaved, and that's the way it behaves in
    any other language [@] i can think of.

    For a container's __getitem__ method to mutate the container is,
    in my opinion, bad behaviour.  In pre-iterator Python, we needed
    some way to allow the convenience of "for" on user-implemented
    containers.  So "for" supported a special protocol where it would
    call __getitem__ with increasing integers starting from 0 until
    it hit an IndexError.  This protocol works great for sequence-like
    containers that were indexable by integers.

    But other containers had to be hacked somewhat to make them fit.
    For example, there was no good way to do "for" over a dictionary-like
    container.  If you attempted "for" over a user-implemented dictionary,
    you got a really weird "KeyError: 0", which only made sense if you
    understood that the "for" loop was attempting __getitem__(0).

    (Hey!  I just noticed that

        from UserDict import UserDict
        for k in UserDict(): print k

    still produces "KeyError: 0"!  This oughta be fixed...)

    If you wanted to support "for" on something else, sometimes you
    would have to make __getitem__ mutate the object, like it does
    in the fileinput module.  But then the user has to know that
    this object is a special case: "for" only works the first time.

    When iterators were introduced, i believed they were supposed
    to solve this problem.  Currently, they don't.

    Currently, "in" can even be destructive.  This is more serious.
    While one could argue that it's not so strange for

        for x in y: ...

    to alter y (even though i do think it is strange), i believe
    just about anyone would find it very counterintuitive for

        if x in y:

    to alter y.  [@]

__iter__-On-Iterators:

    I believe __iter__ is not a type flag.  As i argued previously,
    i think that looking for the presence of methods that don't actually
    implement a protocol is a poor way to check for protocol support.
    And as things stand, the presence of __iter__ doesn't even work [@]
    as a type flag.

    There are objects with __iter__ that are not iterators (like most
    containers).  And there are objects without __iter__ that work as
    iterators.  I know you can legislate the latter away, but i think
    such legislation would amount to fighting the programmers -- and
    it is infeasible [@] to enforce the presence of __iter__ in practice.

    Based on Guido's positive response, in which he asked me to make
    an addition to the PEP, i believe Guido agrees with me that
    __iter__ is distinct from the protocol of an iterator.  This
    surprised me because it runs counter to the philosophy previously
    expressed in the PEP.

    Now suppose we agree that __iter__ and next are distinct protocols.
    Then why require iterators to support both?  The only reason we
    would want __iter__ on iterators is so that we can use "for" [@]
    with an iterator as the second operand.

    I have just argued, above, that it's *not* a good idea for "for"
    and "in" to be destructive.  Since most iterators self-mutate,
    it follows that it's not advisable to use an iterator directly
    as the second operand of a "for" or "in".

    I realize this seems radical!  This may be the most controversial
    point i have made.  But if you accept that "in" should not
    destroy its second argument, the conclusion is unavoidable.

__next__-Naming:

    I think the potential for collision, though small, is significant,
    and this makes "__next__" a better choice than "next".  A built-in
    function next() should be introduced; this function would call the
    tp_iternext slot, and for instance objects tp_iternext would call
    the __next__ method implemented in Python.

    The connection between this issue and the __iter__ issue is that,
    if next() were renamed to __next__(), the argument that __iter__
    is needed as a flag would also go away.

The Current PEP (objective)
---------------------------

The current PEP takes the position that "for" and "in" can be
destructive; that __iter__() and next() represent two distinct
protocols, yet iterators are required to support both; and that
the name of the method on iterators is called "next()".

My Ideal Protocol (subjective)
------------------------------

So by now the biggest question/objection you probably have is
"if i can't use an iterator with 'for', then how can i use it?"

The answer is that "for" is a great way to iterate over things;
it's just that it iterates over containers and i want to preserve
that.  We need a different way to iterate over iterators.

In my ideal world, we would allow a new form of "for", such as

    for line from file:
        print line

The use if "from" instead of "in" would imply that we were
(destructively) pulling things out of the iterator, and would
remove any possible parallel to the test "x in y", which should
rightly remain non-destructive.

Here's the whole deal:

    - Iterators provide just one method, __next__().

    - The built-in next() calls tp_iternext.  For instances,
      tp_iternext calls __next__.

    - Objects wanting to be iterated over provide just one method,
      __iter__().  Some of these are containers, but not all.

    - The built-in iter(foo) calls tp_iter.  For instances,
      tp_iter calls __iter__.

    - "for x in y" gets iter(y) and uses it as an iterator.

    - "for x from y" just uses y as the iterator.

That's it.

Benefits:

    - We have a nice clean division between containers and iterators.

    - When you see "for x in y" you know that y is a container.

    - When you see "for x from y" you know that y is an iterator.

    - "for x in y" never destroys y.

    - "if x in y" never destroys y.

    - If you have an object that is container-like, you can add
      an __iter__ method that gives its natural iterator.  If
      you want, you can supply more iterators that do different
      things; no problem.  No one using your object is confused
      about whether it mutates.

    - If you have an object that is cursor-like or stream-like,
      you can safely make it into an iterator by adding __next__.
      No one using your object is confused about whether it mutates.

Other notes:

    - Iterator algebra still works fine, and is still easy to write:

        def alternate(it):
            while 1:
                yield next(it)
                next(it)

    - The file problem has a consistent solution.  Instead of writing
      "for line in file" you write

        for line from file:
            print line

      Being forced to write "from" signals to you that the file is
      eaten up.  There is no expectation that "for line from file"
      will work again.

      The best would be a convenience function "readlines", to
      make this even clearer:

        for line in readlines("foo.txt"):
            print line

      Now you can do this as many times as you want, and there is
      no possibility of confusion; there is no file object on which
      to call methods that might mess up the reading of lines.

My Not-So-Ideal Protocol
------------------------

All right.  So new syntax may be hard to swallow.  An alternative
is to introduce an adapter that turns an iterator into something
that "for" will accept -- that is, the opposite of iter().

    - The built-in seq(it) returns x such that iter(x) yields it.

Then instead of writing

    for x from it:

you would write

    for x in seq(it):

and the rest would be the same.  The use of "seq" here is what
would flag the fact that "it" will be destroyed.

-- ?!ng