[Tutor] parsing--is this right?
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Mon, 10 Jun 2002 15:02:37 -0700 (PDT)
On Mon, 10 Jun 2002, Paul Tremblay wrote:
> I have just stumbled across the concept of parsing and parsing
> grammars, and wondered if I am using the right tool.
>
> I haved downloaded and installed plex in order to parse a rtf
> document. The rtf looks like this:
>
> {\footnote {\i an italicized word} {\i maybe another italicized
> word} text }
Hi Paul,
Hmmm... if the structure is always described by braces, perhaps it might
be easier to do a "recursive descent" parse through the message. Here's a
sample parser that shows how the technique works.
###
import re
class Chunk:
"""We'll break up our data into Chunk pieces."""
def __init__(self, type, data):
self.type, self.data = type, data
def __str__(self):
if type(self.data) == type([]):
stringed_data = [str(d) for d in self.data]
return "TYPE %s: [%s]" % (self.type, ''.join(stringed_data))
else:
return str(self.data)
def parse(tokens):
assert tokens, "Why are you trying to parse nothing?"
if tokens[0] == '{':
tokens.pop(0) ## Eat the leading bracket.
type = tokens.pop(0)
collected_chunks = []
while tokens[0] != '}':
collected_chunks.append(parse(tokens))
tokens.pop(0) ## Eat the closing bracket.
return Chunk(type, collected_chunks)
else:
## Single token
new_chunk = Chunk('text', tokens.pop(0))
return new_chunk
def tokenize(text):
"A toy tokenizer just for demonstration purposes."
def nonEmpty(thing): return len(thing) > 0
return filter(nonEmpty, re.split(r'([{}]|\s+)', text))
if __name__ == '__main__':
EXAMPLE_TEXT = r"""{\footnote {\i an italicized word} {\i
maybe another italicized word} text }""".replace('\n', ' ')
print parse(tokenize(EXAMPLE_TEXT))
###
The brackets make it really easy to detect boundaries where the parser
should clump things together. I used a pathetically silly tokenizer()
function to do some initial preprocessing of the text: Plex would probably
do a better job at it.
> In order to parse this text, I use a counter to count the number
> of open and closed curly brackets. I am following the tutorial in
> doing this.
>
> However, I am wondering if plex does things the wrong way.
Plex is a tool for breaking the text into "tokens" that are easier to look
at. However, it's not enough. You'll probably want to use a parsing
technique like recursive descent, or use a specialized parser-building
tool. We can talk about it more if you'd like.
> If I understand things correctly, you should not have to count brackets.
> A parser should use the grammar to understand what state you are in.
Yes, exactly. The grammar that the example parser above is reading might
be described as:
;;;
Chunk := text # 'A "Chunk" is made of a piece of text
| '{' Type Chunk* '}' # 'or a bunch of Chunks between two
# brackets, associated with a particular
# chunk type.
;;;
With a parser, there's no need to count how deeply we're nested, as
grammars allow for a recursive definition that can go arbitrarily deep.
> However, I am wondering if it lacks the power a parser should have, and
> if I should devote my time to a better tool.
I've had some experience with the SPARK parser, and it's a nice tool:
http://pages.cpsc.ucalgary.ca/~aycock/spark/
I used it a while back when I was playing with Pythonica, and it seemed to
work very well. The only caveat I could make was that, back then, if
there was an error in the grammar, SPARK wouldn't give good error
messages. I'm not sure if this complaints can be made now: perhaps this
has been fixed.
Another parser that I've heard good things about is Amit Patel's YAPPS
parser:
http://theory.stanford.edu/~amitp/Yapps/
Anyway, I hope this helps!