[Tutor] Parsing problem
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Tue Jul 19 07:19:26 CEST 2005
On Mon, 18 Jul 2005, Liam Clarke wrote:
> country = {
> tag = ENG
> ai = {
> flags = { }
> combat = { DAU FRA ORL PRO }
> continent = { }
> area = { }
> region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" "NAtlanticSea"
> "TagoSea" "WCAtlanticSea" }
> war = 60
> ferocity = no
> }
> }
[Long message ahead; skip if you're not interested.]
Kent mentioned PyParsing,
http://pyparsing.sourceforge.net/
which is a really excellent system. Here's a demo of what it can do, just
so you have a better idea what pyparsing is capable of.
(For the purposes of this demo, I'm doing 'import pyparsing', but in real
usage, I'd probably use 'from pyparsing import ...' just to make things
less verbose.)
Let's say that we want to recognize a simpler subset of the data that you
have there, something like:
{ fee fie foo fum }
And let's imagine that we have a function parse() that can take a string
like:
######
>>> testString = """
... { fee fie foo fum }
... """
######
This imaginary parse() function could turn that into something that looks
like a Python value, like this:
######
>>> parse(testString)
(["fee", "fie", "foo", "fum"])
######
That's our goal; does this make sense so far? So how do we start?
Instead of going at the big goal of doing:
country = { fee fie foo fum }
let's start small by teaching our system how to recognize the innermost
parts, the small things like fee or foo. Let's start there:
######
>>> Symbol = pyparsing.Word(pyparsing.alphas)
######
We want a Symbol to be able to recognize a "Word" made up of alphabetic
letters. Does this work?
######
>>> Symbol.parseString("fee")
(['fee'], {})
#######
Symbol is now a thing that can parse a string, and return a list of
results in a pyparsing.ParseResults object.
Ok, if we can recognize Symbols, let's go for the jugular:
{ fee fie foo fum }
Let's call this a Sequence.
######
>>> Sequence = "{" + pyparsing.ZeroOrMore(Symbol) + "}"
######
A Sequence is made up of zero or more Symbols.
Wait, let's change that, for a moment, to "A Sequence is made up of zero
or more Values." (You'll see why in a moment. *grin*)
If we turn toward this strange way, then we need a definition for a Value:
######
>>> Value = Symbol
######
and now we can say that a Sequence is a bunch of Values:
######
>>> Sequence = "{" + pyparsing.ZeroOrMore(Value) + "}"
######
Let's try this out:
######
>>> Sequence.parseString('{ fee fie foo fum}')
(['{', 'fee', 'fie', 'foo', 'fum', '}'], {})
######
This is close, but it's not quite right: the problem is that we'd like to
somehow group the results all together in a list, and without the braces.
That is, we actually want to see:
[['fee', 'fie', 'foo', 'fum']]
in some form. (Remember, we want a list of a single result, and that
result should be our Sequence.)
How do we get this working? We have to tell pyparsing to "Group" the
middle elements together in a collection, and to "suppress" the braces
from the result.
Here we go:
######
>>> Sequence = (pyparsing.Suppress("{") +
... pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
... pyparsing.Suppress("}"))
######
Does this work?
######
>>> Sequence.parseString('{ fee fie foo fum}')
([(['fee', 'fie', 'foo', 'fum'], {})], {})
######
That looks a little messy and more nested than expected.
Actually, what's happening is that we're looking at that
pyparsing.ParseResults object, so there's more nesting in the string
representation than what's really there. We can use the ParseResults's
asList() method to make it a little easier to see what the real result
value looks like:
######
>>> Sequence.parseString('{ fee fie foo fum}').asList()
[['fee', 'fie', 'foo', 'fum']]
######
That's better.
Out of curiosity, wouldn't it be neat if we could parse out something like
this?
{ fee fie {foo "fum"} }
*cough* *cough*
What we'd like to do is make Sequence itself a possible value. The
problem is that then there's a little circularity involved:
### Illegal PyParsing pseudocode ###
Value = Symbol | Sequence
Sequence = (pyparsing.Suppress("{") +
pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
pyparsing.Suppress("}"))
######
The problem is that Value can't be defined before Sequence is, and
vice-versa. We break this problem by telling PyParsing "ok, the following
rules will come up soon" and "forward" define them:
######
>>> Value = pyparsing.Forward()
>>> Sequence = pyparsing.Forward()
######
and once we have these forward declarations, we can then reconnect them to
their real definitions by using '<<'. (This looks bizarre, but it applies
just to rules that are Forward()ed.)
######
Value << (Symbol | Sequence)
Sequence << (pyparsing.Suppress("{") +
pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
pyparsing.Suppress("}"))
######
Let's try it:
######
>>> Value.parseString(' { fee fie {foo fum} } ').asList()
[['fee', 'fie', ['foo', 'fum']]]
######
Cool.
Ok, that was a little artificial, but oh well. The idea is we now know
how to say:
A Value is either a Symbol or Sequence
and
A Sequence is a bunch of Values
without getting into trouble with pyparsing, and that's important whenever
we're dealing with things that have recursive structure... like:
country = {
tag = ENG
ai = {
flags = { }
combat = { DAU FRA ORL PRO }
continent = { }
area = { }
region = { "British Isles"
"NorthSeaSea"
"ECAtlanticSea"
"NAtlanticSea"
"TagoSea"
"WCAtlanticSea" }
war = 60
ferocity = no }
}
Anyway, this is a really fast whirlwind tour of pyparsing, with some
intentional glossing-over of hard stuff, just so you get a better idea of
the core of parsing. Sorry if it went fast. *grin*
If you have questions, please feel free to ask!
More information about the Tutor
mailing list