[Python-Dev] PEP 255: Simple Generators, Revised Posting

Tim Peters tim.one@home.com
Sat, 23 Jun 2001 05:17:54 -0400

Major revision:  more details about exceptions, return vs StopIteration, and
interactions with try/except/finally; more Q&A; and a BDFL Pronouncement.  The
reference implementation appears solid and works as described here in all
respects, so I expect this will be the last major revision (and so also last
full posting) of this PEP.

The output below is in ndiff format (see Tools/scripts/ndiff.py in your Python
distribution).  Just the new text can be seen in HTML form here:


"Feature discussions" should take place primarily on the Python Iterators list:


Implementation discussions may wander in and out of Python-Dev too.

  PEP: 255
  Title: Simple Generators
- Version: $Revision: 1.3 $
?                       ^
+ Version: $Revision: 1.12 $
?                       ^^
  Author: nas@python.ca (Neil Schemenauer),
      tim.one@home.com (Tim Peters),
      magnus@hetland.org (Magnus Lie Hetland)
  Discussion-To: python-iterators@lists.sourceforge.net
  Status: Draft
  Type: Standards Track
  Requires: 234
  Created: 18-May-2001
  Python-Version: 2.2
- Post-History: 14-Jun-2001
+ Post-History: 14-Jun-2001, 23-Jun-2001
?                          +++++++++++++


      This PEP introduces the concept of generators to Python, as well
      as a new statement used in conjunction with them, the "yield"


      When a producer function has a hard enough job that it requires
      maintaining state between values produced, most programming languages
      offer no pleasant and efficient solution beyond adding a callback
      function to the producer's argument list, to be called with each value

      For example, tokenize.py in the standard library takes this approach:
      the caller must pass a "tokeneater" function to tokenize(), called
      whenever tokenize() finds the next token.  This allows tokenize to be
      coded in a natural way, but programs calling tokenize are typically
      convoluted by the need to remember between callbacks which token(s)
      were seen last.  The tokeneater function in tabnanny.py is a good
      example of that, maintaining a state machine in global variables, to
      remember across callbacks what it has already seen and what it hopes to
      see next.  This was difficult to get working correctly, and is still
      difficult for people to understand.  Unfortunately, that's typical of
      this approach.

      An alternative would have been for tokenize to produce an entire parse
      of the Python program at once, in a large list.  Then tokenize clients
      could be written in a natural way, using local variables and local
      control flow (such as loops and nested if statements) to keep track of
      their state.  But this isn't practical:  programs can be very large, so
      no a priori bound can be placed on the memory needed to materialize the
      whole parse; and some tokenize clients only want to see whether
      something specific appears early in the program (e.g., a future
      statement, or, as is done in IDLE, just the first indented statement),
      and then parsing the whole program first is a severe waste of time.

      Another alternative would be to make tokenize an iterator[1],
      delivering the next token whenever its .next() method is invoked.  This
      is pleasant for the caller in the same way a large list of results
      would be, but without the memory and "what if I want to get out early?"
      drawbacks.  However, this shifts the burden on tokenize to remember
      *its* state between .next() invocations, and the reader need only
      glance at tokenize.tokenize_loop() to realize what a horrid chore that
      would be.  Or picture a recursive algorithm for producing the nodes of
      a general tree structure:  to cast that into an iterator framework
      requires removing the recursion manually and maintaining the state of
      the traversal by hand.

      A fourth option is to run the producer and consumer in separate
      threads.  This allows both to maintain their states in natural ways,
      and so is pleasant for both.  Indeed, Demo/threads/Generator.py in the
      Python source distribution provides a usable synchronized-communication
      class for doing that in a general way.  This doesn't work on platforms
      without threads, though, and is very slow on platforms that do
      (compared to what is achievable without threads).

      A final option is to use the Stackless[2][3] variant implementation of
      Python instead, which supports lightweight coroutines.  This has much
      the same programmatic benefits as the thread option, but is much more
      efficient.  However, Stackless is a controversial rethinking of the
      Python core, and it may not be possible for Jython to implement the
      same semantics.  This PEP isn't the place to debate that, so suffice it
      to say here that generators provide a useful subset of Stackless
      functionality in a way that fits easily into the current CPython
      implementation, and is believed to be relatively straightforward for
      other Python implementations.

      That exhausts the current alternatives.  Some other high-level
      languages provide pleasant solutions, notably iterators in Sather[4],
      which were inspired by iterators in CLU; and generators in Icon[5], a
      novel language where every expression "is a generator".  There are
      differences among these, but the basic idea is the same:  provide a
      kind of function that can return an intermediate result ("the next
      value") to its caller, but maintaining the function's local state so
      that the function can be resumed again right where it left off.  A
      very simple example:

          def fib():
              a, b = 0, 1
              while 1:
                  yield b
                  a, b = b, a+b

      When fib() is first invoked, it sets a to 0 and b to 1, then yields b
      back to its caller.  The caller sees 1.  When fib is resumed, from its
      point of view the yield statement is really the same as, say, a print
      statement:  fib continues after the yield with all local state intact.
      a and b then become 1 and 1, and fib loops back to the yield, yielding
      1 to its invoker.  And so on.  From fib's point of view it's just
      delivering a sequence of results, as if via callback.  But from its
      caller's point of view, the fib invocation is an iterable object that
      can be resumed at will.  As in the thread approach, this allows both
      sides to be coded in the most natural ways; but unlike the thread
      approach, this can be done efficiently and on all platforms.  Indeed,
      resuming a generator should be no more expensive than a function call.

      The same kind of approach applies to many producer/consumer functions.
      For example, tokenize.py could yield the next token instead of invoking
      a callback function with it as argument, and tokenize clients could
      iterate over the tokens in a natural way:  a Python generator is a kind
      of Python iterator[1], but of an especially powerful kind.

- Specification
+ Specification:  Yield
?              ++++++++

      A new statement is introduced:

          yield_stmt:    "yield" expression_list

      "yield" is a new keyword, so a future statement[8] is needed to phase
-     this in.  [XXX spell this out]
+     this in.  [XXX spell this out -- but new keywords have ripple effects
+     across tools too, and it's not clear this can be forced into the future
+     framework at all -- it's not even clear that Python's parser alone can
+     be taught to swing both ways based on a future stmt]

      The yield statement may only be used inside functions.  A function that
-     contains a yield statement is called a generator function.
+     contains a yield statement is called a generator function.  A generator
?                                                               +++++++++++++
+     function is an ordinary function object in all respects, but has the
+     new CO_GENERATOR flag set in the code object's co_flags member.

      When a generator function is called, the actual arguments are bound to
      function-local formal argument names in the usual way, but no code in
      the body of the function is executed.  Instead a generator-iterator
      object is returned; this conforms to the iterator protocol[6], so in
      particular can be used in for-loops in a natural way.  Note that when
      the intent is clear from context, the unqualified name "generator" may
      be used to refer either to a generator-function or a generator-

      Each time the .next() method of a generator-iterator is invoked, the
      code in the body of the generator-function is executed until a yield
      or return statement (see below) is encountered, or until the end of
      the body is reached.

      If a yield statement is encountered, the state of the function is
      frozen, and the value of expression_list is returned to .next()'s
      caller.  By "frozen" we mean that all local state is retained,
      including the current bindings of local variables, the instruction
      pointer, and the internal evaluation stack:  enough information is
      saved so that the next time .next() is invoked, the function can
      proceed exactly as if the yield statement were just another external

+     Restriction:  A yield statement is not allowed in the try clause of a
+     try/finally construct.  The difficulty is that there's no guarantee
+     the generator will ever be resumed, hence no guarantee that the finally
+     block will ever get executed; that's too much a violation of finally's
+     purpose to bear.
+ Specification:  Return
      A generator function can also contain return statements of the form:


      Note that an expression_list is not allowed on return statements
      in the body of a generator (although, of course, they may appear in
      the bodies of non-generator functions nested within the generator).

-     When a return statement is encountered, nothing is returned, but a
+     When a return statement is encountered, control proceeds as in any
+     function return, executing the appropriate finally clauses (if any
-     StopIteration exception is raised, signalling that the iterator is
?                                                           ------------
+     exist).  Then a StopIteration exception is raised, signalling that the
?    ++++++++++++++++
-     exhausted.   The same is true if control flows off the end of the
+     iterator is exhausted.   A StopIteration exception is also raised if
+     control flows off the end of the generator without an explict return.
-     function.  Note that return means "I'm done, and have nothing
?   -----------
+     Note that return means "I'm done, and have nothing interesting to
?                                                       +++++++++++++++
-     interesting to return", for both generator functions and non-generator
?    ---------------
+     return", for both generator functions and non-generator functions.
?                                                            +++++++++++
-     functions.
+     Note that return isn't always equivalent to raising StopIteration:  the
+     difference lies in how enclosing try/except constructs are treated.
+     For example,
+         >>> def f1():
+         ...     try:
+         ...         return
+         ...     except:
+         ...        yield 1
+         >>> print list(f1())
+         []
+     because, as in any function, return simply exits, but
+         >>> def f2():
+         ...     try:
+         ...         raise StopIteration
+         ...     except:
+         ...         yield 42
+         >>> print list(f2())
+         [42]
+     because StopIteration is captured by a bare "except", as is any
+     exception.
+ Specification:  Generators and Exception Propagation
+     If an unhandled exception-- including, but not limited to,
+     StopIteration --is raised by, or passes through, a generator function,
+     then the exception is passed on to the caller in the usual way, and
+     subsequent attempts to resume the generator function raise
+     StopIteration.  In other words, an unhandled exception terminates a
+     generator's useful life.
+     Example (not idiomatic but to illustrate the point):
+     >>> def f():
+     ...     return 1/0
+     >>> def g():
+     ...     yield f()  # the zero division exception propagates
+     ...     yield 42   # and we'll never get here
+     >>> k = g()
+     >>> k.next()
+     Traceback (most recent call last):
+       File "<stdin>", line 1, in ?
+       File "<stdin>", line 2, in g
+       File "<stdin>", line 2, in f
+     ZeroDivisionError: integer division or modulo by zero
+     >>> k.next()  # and the generator cannot be resumed
+     Traceback (most recent call last):
+       File "<stdin>", line 1, in ?
+     StopIteration
+     >>>
+ Specification:  Try/Except/Finally
+     As noted earlier, yield is not allowed in the try clause of a try/
+     finally construct.  A consequence is that generators should allocate
+     critical resources with great care.  There is no restriction on yield
+     otherwise appearing in finally clauses, except clauses, or in the try
+     clause of a try/except construct:
+     >>> def f():
+     ...     try:
+     ...         yield 1
+     ...         try:
+     ...             yield 2
+     ...             1/0
+     ...             yield 3  # never get here
+     ...         except ZeroDivisionError:
+     ...             yield 4
+     ...             yield 5
+     ...             raise
+     ...         except:
+     ...             yield 6
+     ...         yield 7     # the "raise" above stops this
+     ...     except:
+     ...         yield 8
+     ...     yield 9
+     ...     try:
+     ...         x = 12
+     ...     finally:
+     ...         yield 10
+     ...     yield 11
+     >>> print list(f())
+     [1, 2, 4, 5, 8, 9, 10, 11]
+     >>>


          # A binary tree class.
          class Tree:

              def __init__(self, label, left=None, right=None):
                  self.label = label
                  self.left = left
                  self.right = right

              def __repr__(self, level=0, indent="    "):
                  s = level*indent + `self.label`
                  if self.left:
                      s = s + "\n" + self.left.__repr__(level+1, indent)
                  if self.right:
                      s = s + "\n" + self.right.__repr__(level+1, indent)
                  return s

              def __iter__(self):
                  return inorder(self)

          # Create a Tree from a list.
          def tree(list):
              n = len(list)
              if n == 0:
                  return []
              i = n / 2
              return Tree(list[i], tree(list[:i]), tree(list[i+1:]))

          # A recursive generator that generates Tree leaves in in-order.
          def inorder(t):
              if t:
                  for x in inorder(t.left):
                      yield x
                  yield t.label
                  for x in inorder(t.right):
                      yield x

          # Show it off: create a tree.
          # Print the nodes of the tree in in-order.
          for x in t:
              print x,

          # A non-recursive generator.
          def inorder(node):
              stack = []
              while node:
                  while node.left:
                      node = node.left
                  yield node.label
                  while not node.right:
                          node = stack.pop()
                      except IndexError:
                      yield node.label
                  node = node.right

          # Exercise the non-recursive generator.
          for x in t:
              print x,

+     Both output blocks display:
+         A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

  Q & A

+     Q. Why not a new keyword instead of reusing "def"?
+     A. See BDFL Pronouncements section below.
-     Q. Why a new keyword?  Why not a builtin function instead?
+     Q. Why a new keyword for "yield"?  Why not a builtin function instead?
?                         ++++++++++++

      A. Control flow is much better expressed via keyword in Python, and
         yield is a control construct.  It's also believed that efficient
         implementation in Jython requires that the compiler be able to
         determine potential suspension points at compile-time, and a new
-        keyword makes that easy.
+        keyword makes that easy.  The CPython referrence implementation also
+        exploits it heavily, to detect which functions *are* generator-
+        functions (although a new keyword in place of "def" would solve that
+        for CPython -- but people asking the "why a new keyword?" question
+        don't want any new keyword).
+     Q: Then why not some other special syntax without a new keyword?  For
+        example, one of these instead of "yield 3":
+        return 3 and continue
+        return and continue 3
+        return generating 3
+        continue return 3
+        return >> , 3
+        from generator return 3
+        return >> 3
+        return << 3
+        >> 3
+        << 3
+     A: Did I miss one <wink>?  Out of hundreds of messages, I counted two
+        suggesting such an alternative, and extracted the above from them.
+        It would be nice not to need a new keyword, but nicer to make yield
+        very clear -- I don't want to have to *deduce* that a yield is
+        occurring from making sense of a previously senseless sequence of
+        keywords or operators.  Still, if this attracts enough interest,
+        proponents should settle on a single consensus suggestion, and Guido
+        will Pronounce on it.
+     Q. Why allow "return" at all?  Why not force termination to be spelled
+        "raise StopIteration"?
+     A. The mechanics of StopIteration are low-level details, much like the
+        mechanics of IndexError in Python 2.1:  the implementation needs to
+        do *something* well-defined under the covers, and Python exposes
+        these mechanisms for advanced users.  That's not an argument for
+        forcing everyone to work at that level, though.  "return" means "I'm
+        done" in any kind of function, and that's easy to explain and to use.
+        Note that "return" isn't always equivalent to "raise StopIteration"
+        in try/except construct, either (see the "Specification: Return"
+        section).
+     Q. Then why not allow an expression on "return" too?
+     A. Perhaps we will someday.  In Icon, "return expr" means both "I'm
+        done", and "but I have one final useful value to return too, and
+        this is it".  At the start, and in the absence of compelling uses
+        for "return expr", it's simply cleaner to use "yield" exclusively
+        for delivering values.
+ BDFL Pronouncements
+     Issue:  Introduce another new keyword (say, "gen" or "generator") in
+     place of "def", or otherwise alter the syntax, to distinguish
+     generator-functions from non-generator functions.
+     Con:  In practice (how you think about them), generators *are*
+     functions, but with the twist that they're resumable.  The mechanics of
+     how they're set up is a comparatively minor technical issue, and
+     introducing a new keyword would unhelpfully overemphasize the
+     mechanics of how generators get started (a vital but tiny part of a
+     generator's life).
+     Pro:  In reality (how you think about them), generator-functions are
+     actually factory functions that produce generator-iterators as if by
+     magic.  In this respect they're radically different from non-generator
+     functions, acting more like a constructor than a function, so reusing
+     "def" is at best confusing.  A "yield" statement buried in the body is
+     not enough warning that the semantics are so different.
+     BDFL:  "def" it stays.  No argument on either side is totally
+     convincing, so I have consulted my language designer's intuition.  It
+     tells me that the syntax proposed in the PEP is exactly right - not too
+     hot, not too cold.  But, like the Oracle at Delphi in Greek mythology,
+     it doesn't tell me why, so I don't have a rebuttal for the arguments
+     against the PEP syntax.  The best I can come up with (apart from
+     agreeing with the rebuttals ... already made) is "FUD".  If this had
+     been part of the language from day one, I very much doubt it would have
+     made Andrew Kuchling's "Python Warts" page.

  Reference Implementation

-     A preliminary patch against the CVS Python source is available[7].
+     The current implementation, in a preliminary state (no docs and no
+     focused tests), is part of Python's CVS development tree[9].
+     Using this requires that you build Python from source.
+     This was derived from an earlier patch by Neil Schemenauer[7].

  Footnotes and References

      [1] PEP 234, http://python.sf.net/peps/pep-0234.html
      [2] http://www.stackless.com/
      [3] PEP 219, http://python.sf.net/peps/pep-0219.html
      [4] "Iteration Abstraction in Sather"
          Murer , Omohundro, Stoutamire and Szyperski
      [5] http://www.cs.arizona.edu/icon/
      [6] The concept of iterators is described in PEP 234
      [7] http://python.ca/nas/python/generator.diff
      [8] http://python.sf.net/peps/pep-0236.html
+     [9] To experiment with this implementation, check out Python from CVS
+         according to the instructions at
+             http://sf.net/cvs/?group_id=5470


      This document has been placed in the public domain.

  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil