[Tutor] parsing--is this right?

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 11 Jun 2002 10:19:00 -0700 (PDT)


On Mon, 10 Jun 2002, Paul Tremblay wrote:


> I see that you pass a list of tokesn to the parser method in the class
> chunk.
>
[some text cut and moved around]
>
> (2) I don't understand why you don't have to create an object first to
> use this method.
>
> (3) I don't understand how you can call on the Chunk method, when
> it is the name of a class.


Hi Paul:

Actually, the parse() function is meant to be standalone: it's not a
method.  I'm only using Chunk()  to group the data together, and to make
it easier to extract the command type later on.


> (4) I don't understand how this code would work if the tokens were
> broken over lines. I guess you could read the file in as lines and set
> your example text to lines.

As long as we first do all of the tokenization before parsing, we should
be ok.  In many cases, we want to break down this parsing task into two
tasks:

     1. Breaking our text file into a bunch of recognizable tokens
     2. Figuring out the structure between those tokens.

By breaking it down this way, both tasks can become simpler.


The rtf parser I wrote only recognizes two categories of tokens: the
beginning of boundaries (brackets "{}"), and everything else.  Since it
groups these tokens into those two categories, it doesn't have to worry
about newlines.  That's why a lot of parsers are paired together with
tokenizers, so that the parser can avoid thinking about content, and
concentrate more on categories.


> (1)I don't understand how this method continues to read each item in the
> list.

This parser progressively eats more and more of the tokens by using the
pop() method of lists.  It simultaneously removes an element from our
tokens list, and returns that element back to us.  For example:

###
>>> tokens = ['hello', 'world', 'this', 'is', 'a', 'test']
>>> tokens.pop(0)
'hello'
>>> tokens.pop(0)
'world'
>>> tokens.pop(0)
'this'
>>> tokens.pop(0)
'is'
>>> tokens.pop(0)
'a'
>>> tokens.pop(0)
'test'
>>> tokens.pop(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: pop from empty list
###



If it helps, I can try simplifying the parser a little more for clarity.
We can get rid of Chunk() class stuff altogether, and break up that
parse()  function into two pieces to make its intent a little clearer.


###
import re

def parse(tokens):
    """Parses one thing, given a source of tokens.  That one thing can
    be either a single piece of text, or a bracketed list."""
    if tokens[0] == '{': return parseList(tokens)         ## Case 1
    else: return tokens.pop(0)                            ## Case 2


def parseList(tokens):
    """To parse a bracketed list, continue parsing the rest of the token
    stream until we hit the end of the bracketed list."""
    tokens.pop(0)                  ## Eat the leading bracket.
    collected_pieces = []
    while tokens[0] != '}':
        collected_pieces.append(parse(tokens))
    tokens.pop(0)                  ## Eat the closing bracket.
    return collected_pieces


def tokenize(text):
    def nonEmpty(thing):
        return len(thing.strip()) > 0
    return filter(nonEmpty, re.split(r'(\{|\}|\s)', text))
###


Please feel free to ask more questions!  Good luck to you.