[Tutor] parsing--is this right?

Mon, 10 Jun 2002 15:02:37 -0700 (PDT)

On Mon, 10 Jun 2002, Paul Tremblay wrote:

> I have just stumbled across the concept of parsing and parsing
> grammars, and wondered if I am using the right tool.
>
> I haved downloaded and installed plex in order to parse a rtf
> document. The rtf looks like this:
>
> {\footnote {\i an italicized word} {\i maybe another italicized
> word} text }

Hi Paul,

Hmmm... if the structure is always described by braces, perhaps it might
be easier to do a "recursive descent" parse through the message.  Here's a
sample parser that shows how the technique works.

###
import re

class Chunk:
    """We'll break up our data into Chunk pieces."""
    def __init__(self, type, data):
        self.type, self.data = type, data
    def __str__(self):
        if type(self.data) == type([]):
            stringed_data = [str(d) for d in self.data]
            return "TYPE %s: [%s]" % (self.type, ''.join(stringed_data))
        else:
            return str(self.data)

def parse(tokens):
    assert tokens, "Why are you trying to parse nothing?"

    if tokens[0] == '{':
        tokens.pop(0)        ## Eat the leading bracket.
        type = tokens.pop(0)
        collected_chunks = []
        while tokens[0] != '}':
            collected_chunks.append(parse(tokens))
        tokens.pop(0)        ## Eat the closing bracket.
        return Chunk(type, collected_chunks)
    else:
        ## Single token
        new_chunk = Chunk('text', tokens.pop(0))
        return new_chunk

def tokenize(text):
    "A toy tokenizer just for demonstration purposes."
    def nonEmpty(thing): return len(thing) > 0
    return filter(nonEmpty, re.split(r'([{}]|\s+)', text))

if __name__ == '__main__':
    EXAMPLE_TEXT = r"""{\footnote {\i an italicized word} {\i
maybe another italicized word} text }""".replace('\n', ' ')

    print parse(tokenize(EXAMPLE_TEXT))
###

The brackets make it really easy to detect boundaries where the parser
should clump things together.  I used a pathetically silly tokenizer()
function to do some initial preprocessing of the text: Plex would probably
do a better job at it.

> In order to parse this text, I use a counter to count the number
> of open and closed curly brackets. I am following the tutorial in
> doing this.
>
> However, I am wondering if plex does things the wrong way.

Plex is a tool for breaking the text into "tokens" that are easier to look
at.  However, it's not enough.  You'll probably want to use a parsing
technique like recursive descent, or use a specialized parser-building
tool.  We can talk about it more if you'd like.

> If I understand things correctly, you should not have to count brackets.
> A parser should use the grammar to understand what state you are in.

Yes, exactly.  The grammar that the example parser above is reading might
be described as:

;;;
Chunk :=  text                 # 'A "Chunk" is made of a piece of text
       | '{' Type Chunk* '}'   # 'or a bunch of Chunks between two
                               # brackets, associated with a particular
                               # chunk type.
;;;

With a parser, there's no need to count how deeply we're nested, as
grammars allow for a recursive definition that can go arbitrarily deep.

> However, I am wondering if it lacks the power a parser should have, and
> if I should devote my time to a better tool.

I've had some experience with the SPARK parser, and it's a nice tool:

    http://pages.cpsc.ucalgary.ca/~aycock/spark/

I used it a while back when I was playing with Pythonica, and it seemed to
work very well.  The only caveat I could make was that, back then, if
there was an error in the grammar, SPARK wouldn't give good error
messages.  I'm not sure if this complaints can be made now: perhaps this
has been fixed.

Another parser that I've heard good things about is Amit Patel's YAPPS
parser:

    http://theory.stanford.edu/~amitp/Yapps/

Anyway, I hope this helps!