[Python-Dev] iterators

Fri, 18 Aug 2000 16:57:15 -0400

Paul Prescod wrote:

> I don't think of iterators as indexing in terms of numbers. Otherwise I
> could do this:
>
> >>> a={0:"zero",1:"one",2:"two",3:"three"}
> >>> for i in a:
> ...     print i
> ...
>
> So from a Python user's point of view, for-looping has nothing to do
> with integers. From a Python class/module creator's point of view it
> does have to do with integers. I wouldn't be either surprised nor
> disappointed if that changed one day.

Bingo!

I've long had an idea for generalizing 'for' loops using iterators. This is
more a Python 3000 thing, but I'll explain it here anyway because I think
it's relevant. Perhaps this should become a PEP?  (Maybe we should have a
series of PEPs with numbers in the 3000 range for Py3k ideas?)

The statement

  for <variable> in <object>: <block>

should translate into this kind of pseudo-code:

  # variant 1
  __temp = <object>.newiterator()
  while 1:
      try: <variable> = __temp.next()
      except ExhaustedIterator: break
      <block>

or perhaps (to avoid the relatively expensive exception handling):

  # variant 2
  __temp = <object>.newiterator()
  while 1:
      __flag, <variable = __temp.next()
      if not __flag: break
      <block>

In variant 1, the next() method returns the next object or raises
ExhaustedIterator. In variant 2, the next() method returns a tuple (<flag>,
<variable>) where <flag> is 1 to indicate that <value> is valid or 0 to
indicate that there are no more items available. I'm not crazy about the
exception, but I'm even less crazy about the more complex next() return
value (careful observers may have noticed that I'm rarely crazy about flag
variables :-).

Another argument for variant 1 is that variant 2 changes what <variable> is
after the loop is exhausted, compared to current practice: currently, it
keeps the last valid value assigned to it. Most likely, the next() method
returns None when the sequence is exhausted. It doesn't make a lot of sense
to require it to return the last item of the sequence -- there may not *be*
a last item, if the sequence is empty, and not all sequences make it
convenient to keep hanging on to the last item in the iterator, so it's best
to specify that next() returns (0, None) when the sequence is exhausted.

(It would be tempting to suggeste a variant 1a where instead of raising an
exception, next() returns None when the sequence is exhausted, but this
won't fly: you couldn't iterate over a list containing some items that are
None.)

Side note: I believe that the iterator approach could actually *speed up*
iteration over lists compared to today's code. This is because currently the
interation index is a Python integer object that is stored on the stack.
This means an integer add with overflow check, allocation, and deallocation
on each iteration! But the iterator for lists (and other basic sequences)
could easily store the index as a C int! (As long as the sequence's length
is stored in an int, the index can't overflow.)

[Warning: thinking aloud ahead!]

Once we have the concept of iterators, we could support explicit use of them
as well. E.g. we could use a variant of the for statement to iterate over an
existing iterator:

  for <variable> over <iterator>: <block>

which would (assuming variant 1 above) translate to:

  while 1:
      try: <variable> = <iterator>.next()
      except ExhaustedIterator: break
      <block>

This could be used in situations where you have a loop iterating over the
first half of a sequence and a second loop that iterates over the remaining
items:

  it = something.newiterator()
  for x over it:
      if time_to_start_second_loop(): break
      do_something()
  for x over it:
      do_something_else()

Note that the second for loop doesn't reset the iterator -- it just picks up
where the first one left off! (Detail: the x that caused the break in the
first loop doesn't get dealt with in the second loop.)

I like the iterator concept because it allows us to do things lazily. There
are lots of other possibilities for iterators. E.g. mappings could have
several iterator variants to loop over keys, values, or both, in sorted or
hash-table order. Sequences could have an iterator for traversing them
backwards, and a few other ones for iterating over their index set (cf.
indices()) and over (index, value) tuples (cf. irange()). Files could be
their own iterator where the iterator is almost the same as readline()
except it raises ExhaustedIterator at EOF instead of returning "".  A tree
datastructure class could have an associated iterator class that maintains a
"finger" into the tree.

Hm, perhaps iterators could be their own iterator? Then if 'it' were an
iterator, it.newiterator() would return a reference to itself (not a copy).
Then we wouldn't even need the 'over' alternative syntax. Maybe the method
should be called iterator() then, not newiterator(), to avoid suggesting
anything about the newness of the returned iterator.

Other ideas:

- Iterators could have a backup() method that moves the index back (or
raises an exception if this feature is not supported, e.g. when reading data
from a pipe).

- Iterators over certain sequences might support operations on the
underlying sequence at the current position of the iterator, so that you
could iterate over a sequence and occasionally insert or delete an item (or
a slice).

Of course iterators also connect to generators. The basic list iterator
doesn't need coroutines or anything, it can be done as follows:

  class Iterator:
      def __init__(self, seq):
          self.seq = seq
          self.ind = 0
      def next(self):
          if self.ind >= len(self.seq): raise ExhaustedIterator
          val = self.seq[self.ind]
          self.ind += 1
          return val

so that <list>.iterator() could just return Iterator(<list>) -- at least
conceptually.

But for other data structures the amount of state needed might be
cumbersome. E.g. a tree iterator needs to maintain a stack, and it's much
easier to code that using a recursive Icon-style generator than by using an
explicit stack. On the other hand, I remember reading an article a while ago
(in Dr. Dobbs?) by someone who argued (in a C++ context) that such recursive
solutions are very inefficient, and that an explicit stack (1) is really not
that hard to code, and (2) gives much more control over the memory and time
consumption of the code. On the third hand, some forms of iteration really
*are* expressed much more clearly using recursion. On the fourth hand, I
disagree with Matthias ("Dr. Scheme") Felleisen about recursion as the root
of all iteration. Anyway, I believe that iterators (as explained above) can
be useful even if we don't have generators (in the Icon sense, which I
believe means coroutine-style).

--Guido