[Python-ideas] Yield-from example: A parser

Tue Feb 17 09:45:33 CET 2009

Here's an attempt at providing a fleshed-out use case
for some of the more contentious parts of the yield-from
proposal.

Exammple using send() and return-with-value to implement a parser
=================================================================

I'm going to develop two versions of a simple XML parser. The first
version will use current Python facilities to do it in "pull" mode,
where the scanner is a generator and the parser is a set of
recursive functions that pull tokens out of it as needed.

The second version will turn it around so that it operates in
"push" mode, with the scanner being an outer loop, and the parser
being a generator that gets tokens pushed into it using "push".
Using yield-from, it will be seen that the parser code is almost
identical to the first version, the main difference being that
some function calls are replaced by yield-from statements.

Version 1 - Pull-mode, recursive functions
------------------------------------------

First, let's write a scanner and set it to work on some
test data.

   import re
   pat = re.compile(r"(\S+)|(<[^>]*>)")

   def scanner(text):
     for m in pat.finditer(text):
       token = m.group(0)
       print "Feeding:", repr(token)
       yield token
     yield None # to signal EOF

   text = "<foo> This is a <b> foo file </b> you know. </foo>"
   token_stream = scanner(text)

I'm using a global variable 'token_stream' to hold the
source of tokens here. It could just as well be passed
around as a parameter to the parsing functions, but doing
it this way will bring out the symmetry with the other
version better later on.

Now let's write a function to parse a sequence of items,
each of which is either a plain word or a compound element
surrounted by opening and closing tags. We will return the
result as a suitably-structured parse tree.

   def parse_items(closing_tag = None):
     elems = []
     while 1:
       token = token_stream.next()
       if not token:
         break # EOF
       if is_opening_tag(token):
         elems.append(parse_elem(token))
       elif token == closing_tag:
         break
       else:
         elems.append(token)
     return elems

This makes use of a small utility function to see if a
token is an opening tag:

   def is_opening_tag(token):
     return token.startswith("<") and not token.startswith("</")

It also mutually recurses with the following function that
parses a compound element. It gets passed the opening tag and
eats up all tokens until the corresponding closing tag. It
returns a parse tree for the compound element.

   def parse_elem(opening_tag):
     name = opening_tag[1:-1]
     closing_tag = "</%s>" % name
     items = parse_items(closing_tag)
     return (name, items)

Now we can call the parser and print the result.

   tree = parse_items()
   print tree

This is all currently-valid Python, and when run it produces
the followiing output.

   Feeding: '<foo>'
   Feeding: 'This'
   Feeding: 'is'
   Feeding: 'a'
   Feeding: '<b>'
   Feeding: 'foo'
   Feeding: 'file'
   Feeding: '</b>'
   Feeding: 'you'
   Feeding: 'know.'
   Feeding: '</foo>'
   [('foo', ['This', 'is', 'a', ('b', ['foo', 'file']), 'you', 'know.'])]

Version 2 - Push-mode, recursive generators
-------------------------------------------

For this version, we turn things around so that the scanner
is on the outside. The top-level code will now be like this:

def run():
   parser = parse_items()
   try:
     for m in pat.finditer(text):
       token = m.group(0)
       print "Feeding:", repr(token)
       parser.send(token)
     parser.send(None) # to signal EOF
   except StopIteration, e:
     tree = e.args[0]
     print tree

The parser is going to be a generator that returns the parse
tree when it exits. To the outside, this manifests as a
StopIteration exception with an argument, which we extract.

(We're assuming that there are no syntax errors in the
input, so the parser is ready to exit by the time the input
is exhausted. A more robust implementation would take care
of the case that it isn't.)

Now the only thing we need to do to our two parsing functions
parse_items and parse_elem is to replace any calls to get a
token with 'yield', and insert 'yield from' into each of the
recursive calls. The changes are marked with arrows below.

   def parse_elem(opening_tag):
     name = opening_tag[1:-1]
     closing_tag = "</%s>" % name
     items = yield from parse_items(closing_tag) # <---
     return (name, items)

   def parse_items(closing_tag = None):
     elems = []
     while 1:
       token = yield # <---
       if not token:
         break # EOF
       if is_opening_tag(token):
         elems.append(yield from parse_elem(token)) # <---
       elif elem == closing_tag:
         break
       else:
         elems.append(token)
     return elems

That's all -- everything else remains the same.

I can't run this, because yield-from doesn't exist, but if I
could, the output would be the same as before.

I hope this helps to give an idea of what I'm talking about
with generators-as-threads. The whole parser can be seen as
a lightweight thread, implemented as a collection of mutually
recursive functions. Whenever any of the functions needs to
get a token, it yields. To return a value to the calling
function, it uses a return statement just like any other
function.

I hope it is also clear why returning values via yield,
or having 'yield from' return any of the yielded values,
would be the wrong thing to do. The send/yield channel and
the return channel are being used for completely different
purposes, and conflating them would be disastrous.

-- 
Greg