[Tutor] Parsing problem

Wed Jul 20 14:05:20 CEST 2005

Well, I've been poking around and... well.. this is way better than writing 
complex regexes.

To suit my needs, I need something that can handle - 

foo = bar
foo = 20
foo = { bar 20 }
foo = { bar = 20 baz}
foo = {bar = 20 baz { dave henry}}

OK, so the last one's extreme. So far, I can handle down to foo = { bar 20 
}, but it looks ugly, so some feedback on my very rough usage of pyparsing 
would be great. 

>>> from pyparsing import Word, Suppress, alphas, nums
>>> q = (Word(alphas) + Suppress("=") + ( ( Word(nums) | Word(alphas) ) | ( 
Suppress("{") + pyparsing.ZeroOrMore( Word (alphas) | Word(nums) ) + 
Suppress("}" ) ) ) )
>>> q.parseString("foo = bar").asList()
['foo', 'bar']
>>> q.parseString("a = 23").asList()
['a', '23']
>>> q.parseString(" foo = { bar baz 23 }").asList()
['foo', 'bar', 'baz', '23']

Yeech. 

I'm sure I can shorten that a whole lot ( I just found alphanums in the 
manual, d'oh. ), but it works pretty good out of the box. Thanks for the 
heads up.

Couple of queries -

I think I understand Danny's example of circular references. 

------
Value << (Symbol | Sequence)
Sequence << (pyparsing.Suppress("{") +
pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
pyparsing.Suppress("}"))
------

Sequence depends on Value for it's *ahem* value, but Value depends on 
Sequence for it's value, so I'll play with that.

Is anyone able to post an example of returning dictionaries from 
ParsingResults? If so, it would be brilliant. 

The documentation states - 
"the Dict class generates dictionary entries using the data of the input 
text - in addition to ParseResults listed as [ [ a1, b1, c1, ...], [ a2, b2, 
c2, ...] ] it also acts as a dictionary with entries defined as { a1 : [ b1, 
c1, ... ] }, { a2 : [ b2, c2, ... ] };"

Problem is, I haven't figured out how to use it yet, I know I could use 
pyparsing.Group(stuff) to ensure proper key:value pairings. 

Thanks for the pointers so far, feeling very chuffed with myself for 
managing to get this far, I had strayed into VBA territory, it's nice to 
work with real objects again. 

And of course, always open to being shown the simple, elegant way. ;)

Many thanks, 

Liam Clarke

On 7/19/05, Liam Clarke <cyresse at gmail.com> wrote:
> 
> Thanks guys, I daresay I will have a lot of questions regarding this, but 
> at least I have a point to start digging and a better shovel!
> 
> Cheers, 
> 
> Liam Clarke
> 
> On 7/19/05, Danny Yoo <dyoo at hkn.eecs.berkeley.edu> wrote:
> > 
> > 
> > 
> > On Mon, 18 Jul 2005, Liam Clarke wrote:
> > 
> > > country = {
> > > tag = ENG
> > > ai = {
> > > flags = { }
> > > combat = { DAU FRA ORL PRO }
> > > continent = { }
> > > area = { }
> > > region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" 
> > "NAtlanticSea" 
> > > "TagoSea" "WCAtlanticSea" }
> > > war = 60
> > > ferocity = no
> > > }
> > > }
> > 
> > [Long message ahead; skip if you're not interested.]
> > 
> > 
> > Kent mentioned PyParsing,
> > 
> > http://pyparsing.sourceforge.net/
> > 
> > which is a really excellent system. Here's a demo of what it can do, 
> > just
> > so you have a better idea what pyparsing is capable of. 
> > 
> > (For the purposes of this demo, I'm doing 'import pyparsing', but in 
> > real
> > usage, I'd probably use 'from pyparsing import ...' just to make things
> > less verbose.)
> > 
> > 
> > Let's say that we want to recognize a simpler subset of the data that 
> > you 
> > have there, something like:
> > 
> > { fee fie foo fum }
> > 
> > And let's imagine that we have a function parse() that can take a string
> > like:
> > 
> > ######
> > >>> testString = """
> > ... { fee fie foo fum } 
> > ... """
> > ######
> > 
> > 
> > This imaginary parse() function could turn that into something that 
> > looks
> > like a Python value, like this:
> > 
> > ######
> > >>> parse(testString)
> > (["fee", "fie", "foo", "fum"]) 
> > ######
> > 
> > That's our goal; does this make sense so far? So how do we start?
> > 
> > 
> > 
> > Instead of going at the big goal of doing:
> > 
> > country = { fee fie foo fum }
> > 
> > let's start small by teaching our system how to recognize the innermost 
> > parts, the small things like fee or foo. Let's start there:
> > 
> > ######
> > >>> Symbol = pyparsing.Word(pyparsing.alphas)
> > ######
> > 
> > We want a Symbol to be able to recognize a "Word" made up of alphabetic 
> > letters. Does this work?
> > 
> > ######
> > >>> Symbol.parseString("fee")
> > (['fee'], {})
> > #######
> > 
> > Symbol is now a thing that can parse a string, and return a list of
> > results in a pyparsing.ParseResults object.
> > 
> > 
> > Ok, if we can recognize Symbols, let's go for the jugular:
> > 
> > { fee fie foo fum }
> > 
> > 
> > Let's call this a Sequence.
> > 
> > ######
> > >>> Sequence = "{" + pyparsing.ZeroOrMore (Symbol) + "}"
> > ######
> > 
> > 
> > A Sequence is made up of zero or more Symbols.
> > 
> > 
> > Wait, let's change that, for a moment, to "A Sequence is made up of zero
> > or more Values." (You'll see why in a moment. *grin*) 
> > 
> > 
> > 
> > If we turn toward this strange way, then we need a definition for a 
> > Value:
> > 
> > ######
> > >>> Value = Symbol
> > ######
> > 
> > and now we can say that a Sequence is a bunch of Values:
> > 
> > ###### 
> > >>> Sequence = "{" + pyparsing.ZeroOrMore(Value) + "}"
> > ######
> > 
> > 
> > Let's try this out:
> > 
> > ######
> > >>> Sequence.parseString('{ fee fie foo fum}')
> > (['{', 'fee', 'fie', 'foo', 'fum', '}'], {}) 
> > ######
> > 
> > 
> > This is close, but it's not quite right: the problem is that we'd like 
> > to
> > somehow group the results all together in a list, and without the 
> > braces.
> > That is, we actually want to see:
> > 
> > [['fee', 'fie', 'foo', 'fum']] 
> > 
> > in some form. (Remember, we want a list of a single result, and that
> > result should be our Sequence.)
> > 
> > 
> > How do we get this working? We have to tell pyparsing to "Group" the
> > middle elements together in a collection, and to "suppress" the braces 
> > from the result.
> > 
> > Here we go:
> > 
> > ######
> > >>> Sequence = (pyparsing.Suppress("{") +
> > ... pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> > ... pyparsing.Suppress ("}"))
> > ######
> > 
> > Does this work?
> > 
> > 
> > ######
> > >>> Sequence.parseString('{ fee fie foo fum}')
> > ([(['fee', 'fie', 'foo', 'fum'], {})], {})
> > ######
> > 
> > 
> > That looks a little messy and more nested than expected. 
> > 
> > 
> > Actually, what's happening is that we're looking at that
> > pyparsing.ParseResults object, so there's more nesting in the string
> > representation than what's really there. We can use the ParseResults's
> > asList() method to make it a little easier to see what the real result 
> > value looks like:
> > 
> > ######
> > >>> Sequence.parseString('{ fee fie foo fum}').asList()
> > [['fee', 'fie', 'foo', 'fum']]
> > ######
> > 
> > That's better.
> > 
> > 
> > 
> > Out of curiosity, wouldn't it be neat if we could parse out something 
> > like 
> > this?
> > 
> > { fee fie {foo "fum"} }
> > 
> > *cough* *cough*
> > 
> > What we'd like to do is make Sequence itself a possible value. The
> > problem is that then there's a little circularity involved:
> > 
> > 
> > ### Illegal PyParsing pseudocode ###
> > Value = Symbol | Sequence
> > 
> > Sequence = (pyparsing.Suppress("{") +
> > pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> > pyparsing.Suppress ("}"))
> > ######
> > 
> > The problem is that Value can't be defined before Sequence is, and
> > vice-versa. We break this problem by telling PyParsing "ok, the 
> > following
> > rules will come up soon" and "forward" define them: 
> > 
> > ######
> > >>> Value = pyparsing.Forward()
> > >>> Sequence = pyparsing.Forward()
> > ######
> > 
> > and once we have these forward declarations, we can then reconnect them 
> > to
> > their real definitions by using '<<'. (This looks bizarre, but it 
> > applies 
> > just to rules that are Forward()ed.)
> > 
> > ######
> > Value << (Symbol | Sequence)
> > Sequence << (pyparsing.Suppress("{") +
> > pyparsing.Group(pyparsing.ZeroOrMore(Value)) +
> > pyparsing.Suppress("}"))
> > ######
> > 
> > 
> > Let's try it:
> > 
> > ######
> > >>> Value.parseString(' { fee fie {foo fum} } ').asList()
> > [['fee', 'fie', ['foo', 'fum']]]
> > ######
> > 
> > 
> > Cool.
> > 
> > 
> > Ok, that was a little artificial, but oh well. The idea is we now know
> > how to say:
> > 
> > A Value is either a Symbol or Sequence
> > 
> > and
> > 
> > A Sequence is a bunch of Values
> > 
> > without getting into trouble with pyparsing, and that's important 
> > whenever 
> > we're dealing with things that have recursive structure... like:
> > 
> > country = {
> > tag = ENG
> > ai = {
> > flags = { }
> > combat = { DAU FRA ORL PRO }
> > continent = { }
> > area = { }
> > region = { "British Isles"
> > "NorthSeaSea"
> > "ECAtlanticSea"
> > "NAtlanticSea"
> > "TagoSea"
> > "WCAtlanticSea" }
> > war = 60
> > ferocity = no }
> > }
> > 
> > Anyway, this is a really fast whirlwind tour of pyparsing, with some
> > intentional glossing-over of hard stuff, just so you get a better idea 
> > of
> > the core of parsing. Sorry if it went fast. *grin* 
> > 
> > 
> > If you have questions, please feel free to ask!
> > 
> > 
> 
> 
> -- 
> 'There is only one basic human right, and that is to do as you damn well 
> please.
> And with it comes the only basic human duty, to take the consequences.' 

-- 
'There is only one basic human right, and that is to do as you damn well 
please.
And with it comes the only basic human duty, to take the consequences.'
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050721/91cd582c/attachment.htm