[Tutor] Parsing problem

Liam Clarke cyresse at gmail.com
Tue Jul 19 08:23:20 CEST 2005


Thanks guys, I daresay I will have a lot of questions regarding this, but at 
least I have a point to start digging and a better shovel!

Cheers, 

Liam Clarke

On 7/19/05, Danny Yoo <dyoo at hkn.eecs.berkeley.edu> wrote:
> 
> 
> 
> On Mon, 18 Jul 2005, Liam Clarke wrote:
> 
> > country = {
> > tag = ENG
> > ai = {
> > flags = { }
> > combat = { DAU FRA ORL PRO }
> > continent = { }
> > area = { }
> > region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" "NAtlanticSea"
> > "TagoSea" "WCAtlanticSea" }
> > war = 60
> > ferocity = no
> > }
> > }
> 
> [Long message ahead; skip if you're not interested.]
> 
> 
> Kent mentioned PyParsing,
> 
> http://pyparsing.sourceforge.net/
> 
> which is a really excellent system. Here's a demo of what it can do, just
> so you have a better idea what pyparsing is capable of.
> 
> (For the purposes of this demo, I'm doing 'import pyparsing', but in real
> usage, I'd probably use 'from pyparsing import ...' just to make things
> less verbose.)
> 
> 
> Let's say that we want to recognize a simpler subset of the data that you
> have there, something like:
> 
> { fee fie foo fum }
> 
> And let's imagine that we have a function parse() that can take a string
> like:
> 
> ######
> >>> testString = """
> ... { fee fie foo fum }
> ... """
> ######
> 
> 
> This imaginary parse() function could turn that into something that looks
> like a Python value, like this:
> 
> ######
> >>> parse(testString)
> (["fee", "fie", "foo", "fum"])
> ######
> 
> That's our goal; does this make sense so far? So how do we start?
> 
> 
> 
> Instead of going at the big goal of doing:
> 
> country = { fee fie foo fum }
> 
> let's start small by teaching our system how to recognize the innermost
> parts, the small things like fee or foo. Let's start there:
> 
> ######
> >>> Symbol = pyparsing.Word(pyparsing.alphas)
> ######
> 
> We want a Symbol to be able to recognize a "Word" made up of alphabetic
> letters. Does this work?
> 
> ######
> >>> Symbol.parseString("fee")
> (['fee'], {})
> #######
> 
> Symbol is now a thing that can parse a string, and return a list of
> results in a pyparsing.ParseResults object.
> 
> 
> Ok, if we can recognize Symbols, let's go for the jugular:
> 
> { fee fie foo fum }
> 
> 
> Let's call this a Sequence.
> 
> ######
> >>> Sequence = "{" + pyparsing.ZeroOrMore(Symbol) + "}"
> ######
> 
> 
> A Sequence is made up of zero or more Symbols.
> 
> 
> Wait, let's change that, for a moment, to "A Sequence is made up of zero
> or more Values." (You'll see why in a moment. *grin*)
> 
> 
> 
> If we turn toward this strange way, then we need a definition for a Value:
> 
> ######
> >>> Value = Symbol
> ######
> 
> and now we can say that a Sequence is a bunch of Values:
> 
> ######
> >>> Sequence = "{" + pyparsing.ZeroOrMore(Value) + "}"
> ######
> 
> 
> Let's try this out:
> 
> ######
> >>> Sequence.parseString('{ fee fie foo fum}')
> (['{', 'fee', 'fie', 'foo', 'fum', '}'], {})
> ######
> 
> 
> This is close, but it's not quite right: the problem is that we'd like to
> somehow group the results all together in a list, and without the braces.
> That is, we actually want to see:
> 
> [['fee', 'fie', 'foo', 'fum']]
> 
> in some form. (Remember, we want a list of a single result, and that
> result should be our Sequence.)
> 
> 
> How do we get this working? We have to tell pyparsing to "Group" the
> middle elements together in a collection, and to "suppress" the braces
> from the result.
> 
> Here we go:
> 
> ######
> >>> Sequence = (pyparsing.Suppress("{") +
> ... pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> ... pyparsing.Suppress("}"))
> ######
> 
> Does this work?
> 
> 
> ######
> >>> Sequence.parseString('{ fee fie foo fum}')
> ([(['fee', 'fie', 'foo', 'fum'], {})], {})
> ######
> 
> 
> That looks a little messy and more nested than expected.
> 
> 
> Actually, what's happening is that we're looking at that
> pyparsing.ParseResults object, so there's more nesting in the string
> representation than what's really there. We can use the ParseResults's
> asList() method to make it a little easier to see what the real result
> value looks like:
> 
> ######
> >>> Sequence.parseString('{ fee fie foo fum}').asList()
> [['fee', 'fie', 'foo', 'fum']]
> ######
> 
> That's better.
> 
> 
> 
> Out of curiosity, wouldn't it be neat if we could parse out something like
> this?
> 
> { fee fie {foo "fum"} }
> 
> *cough* *cough*
> 
> What we'd like to do is make Sequence itself a possible value. The
> problem is that then there's a little circularity involved:
> 
> 
> ### Illegal PyParsing pseudocode ###
> Value = Symbol | Sequence
> 
> Sequence = (pyparsing.Suppress("{") +
> pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> pyparsing.Suppress("}"))
> ######
> 
> The problem is that Value can't be defined before Sequence is, and
> vice-versa. We break this problem by telling PyParsing "ok, the following
> rules will come up soon" and "forward" define them:
> 
> ######
> >>> Value = pyparsing.Forward()
> >>> Sequence = pyparsing.Forward()
> ######
> 
> and once we have these forward declarations, we can then reconnect them to
> their real definitions by using '<<'. (This looks bizarre, but it applies
> just to rules that are Forward()ed.)
> 
> ######
> Value << (Symbol | Sequence)
> Sequence << (pyparsing.Suppress("{") +
> pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> pyparsing.Suppress("}"))
> ######
> 
> 
> Let's try it:
> 
> ######
> >>> Value.parseString(' { fee fie {foo fum} } ').asList()
> [['fee', 'fie', ['foo', 'fum']]]
> ######
> 
> 
> Cool.
> 
> 
> Ok, that was a little artificial, but oh well. The idea is we now know
> how to say:
> 
> A Value is either a Symbol or Sequence
> 
> and
> 
> A Sequence is a bunch of Values
> 
> without getting into trouble with pyparsing, and that's important whenever
> we're dealing with things that have recursive structure... like:
> 
> country = {
> tag = ENG
> ai = {
> flags = { }
> combat = { DAU FRA ORL PRO }
> continent = { }
> area = { }
> region = { "British Isles"
> "NorthSeaSea"
> "ECAtlanticSea"
> "NAtlanticSea"
> "TagoSea"
> "WCAtlanticSea" }
> war = 60
> ferocity = no }
> }
> 
> Anyway, this is a really fast whirlwind tour of pyparsing, with some
> intentional glossing-over of hard stuff, just so you get a better idea of
> the core of parsing. Sorry if it went fast. *grin*
> 
> 
> If you have questions, please feel free to ask!
> 
> 


-- 
'There is only one basic human right, and that is to do as you damn well 
please.
And with it comes the only basic human duty, to take the consequences.'
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050719/0a4c6b6f/attachment-0001.htm


More information about the Tutor mailing list