Thanks guys, I daresay I will have a lot of questions regarding this,
but at least I have a point to start digging and a better shovel!<br>
<br>
Cheers, <br>
<br>
Liam Clarke<br><br><div><span class="gmail_quote">On 7/19/05, <b class="gmail_sendername">Danny Yoo</b> <<a href="mailto:dyoo@hkn.eecs.berkeley.edu">dyoo@hkn.eecs.berkeley.edu</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br><br>On Mon, 18 Jul 2005, Liam Clarke wrote:<br><br>> country = {<br>> tag = ENG<br>> ai = {<br>> flags = { }<br>> combat = { DAU FRA ORL PRO }<br>> continent = { }<br>> area = { }<br>> region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" "NAtlanticSea"
<br>> "TagoSea" "WCAtlanticSea" }<br>> war = 60<br>> ferocity = no<br>> }<br>> }<br><br>[Long message ahead; skip if you're not interested.]<br><br><br>Kent mentioned PyParsing,<br><br>
<a href="http://pyparsing.sourceforge.net/">http://pyparsing.sourceforge.net/</a><br><br>which is a really excellent system. Here's a demo of what it can do, just<br>so you have a better idea what pyparsing is capable of.
<br><br>(For the purposes of this demo, I'm doing 'import pyparsing', but in real<br>usage, I'd probably use 'from pyparsing import ...' just to make things<br>less verbose.)<br><br><br>Let's say that we want to recognize a simpler subset of the data that you
<br>have there, something like:<br><br> { fee fie foo fum }<br><br>And let's imagine that we have a function parse() that can take a string<br>like:<br><br>######<br>>>> testString = """<br>... { fee fie foo fum }
<br>... """<br>######<br><br><br>This imaginary parse() function could turn that into something that looks<br>like a Python value, like this:<br><br>######<br>>>> parse(testString)<br>(["fee", "fie", "foo", "fum"])
<br>######<br><br>That's our goal; does this make sense so far? So how do we start?<br><br><br><br>Instead of going at the big goal of doing:<br><br> country = { fee fie foo fum }<br><br>let's start small by teaching our system how to recognize the innermost
<br>parts, the small things like fee or foo. Let's start there:<br><br>######<br>>>> Symbol = pyparsing.Word(pyparsing.alphas)<br>######<br><br>We want a Symbol to be able to recognize a "Word" made up of alphabetic
<br>letters. Does this work?<br><br>######<br>>>> Symbol.parseString("fee")<br>(['fee'], {})<br>#######<br><br>Symbol is now a thing that can parse a string, and return a list of<br>results in a pyparsing.ParseResults
object.<br><br><br>Ok, if we can recognize Symbols, let's go for the jugular:<br><br> { fee fie foo fum }<br><br><br>Let's call this a Sequence.<br><br>######<br>>>> Sequence = "{" + pyparsing.ZeroOrMore
(Symbol) + "}"<br>######<br><br><br>A Sequence is made up of zero or more Symbols.<br><br><br>Wait, let's change that, for a moment, to "A Sequence is made up of zero<br>or more Values." (You'll see why in a moment. *grin*)
<br><br><br><br>If we turn toward this strange way, then we need a definition for a Value:<br><br>######<br>>>> Value = Symbol<br>######<br><br>and now we can say that a Sequence is a bunch of Values:<br><br>######
<br>>>> Sequence = "{" + pyparsing.ZeroOrMore(Value) + "}"<br>######<br><br><br>Let's try this out:<br><br>######<br>>>> Sequence.parseString('{ fee fie foo fum}')<br>(['{', 'fee', 'fie', 'foo', 'fum', '}'], {})
<br>######<br><br><br>This is close, but it's not quite right: the problem is that we'd like to<br>somehow group the results all together in a list, and without the braces.<br>That is, we actually want to see:<br><br> [['fee', 'fie', 'foo', 'fum']]
<br><br>in some form. (Remember, we want a list of a single result, and that<br>result should be our Sequence.)<br><br><br>How do we get this working? We have to tell pyparsing to "Group" the<br>middle elements together in a collection, and to "suppress" the braces
<br>from the result.<br><br>Here we go:<br><br>######<br>>>> Sequence = (pyparsing.Suppress("{") +<br>... pyparsing.Group(pyparsing.ZeroOrMore(Value)) +<br>... pyparsing.Suppress
("}"))<br>######<br><br>Does this work?<br><br><br>######<br>>>> Sequence.parseString('{ fee fie foo fum}')<br>([(['fee', 'fie', 'foo', 'fum'], {})], {})<br>######<br><br><br>That looks a little messy and more nested than expected.
<br><br><br>Actually, what's happening is that we're looking at that<br>pyparsing.ParseResults object, so there's more nesting in the string<br>representation than what's really there. We can use the ParseResults's<br>asList() method to make it a little easier to see what the real result
<br>value looks like:<br><br>######<br>>>> Sequence.parseString('{ fee fie foo fum}').asList()<br>[['fee', 'fie', 'foo', 'fum']]<br>######<br><br>That's better.<br><br><br><br>Out of curiosity, wouldn't it be neat if we could parse out something like
<br>this?<br><br> { fee fie {foo "fum"} }<br><br>*cough* *cough*<br><br>What we'd like to do is make Sequence itself a possible value. The<br>problem is that then there's a little circularity involved:<br><br>
<br>### Illegal PyParsing pseudocode ###<br>Value = Symbol | Sequence<br><br>Sequence = (pyparsing.Suppress("{") +<br> pyparsing.Group(pyparsing.ZeroOrMore(Value)) +<br> pyparsing.Suppress
("}"))<br>######<br><br>The problem is that Value can't be defined before Sequence is, and<br>vice-versa. We break this problem by telling PyParsing "ok, the following<br>rules will come up soon" and "forward" define them:
<br><br>######<br>>>> Value = pyparsing.Forward()<br>>>> Sequence = pyparsing.Forward()<br>######<br><br>and once we have these forward declarations, we can then reconnect them to<br>their real definitions by using '<<'. (This looks bizarre, but it applies
<br>just to rules that are Forward()ed.)<br><br>######<br>Value << (Symbol | Sequence)<br>Sequence << (pyparsing.Suppress("{") +<br> pyparsing.Group(pyparsing.ZeroOrMore(Value)) +<br>
pyparsing.Suppress("}"))<br>######<br><br><br>Let's try it:<br><br>######<br>>>> Value.parseString(' { fee fie {foo fum} } ').asList()<br>[['fee', 'fie', ['foo', 'fum']]]<br>######<br><br><br>Cool.<br><br>
<br>Ok, that was a little artificial, but oh well. The idea is we now know<br>how to say:<br><br> A Value is either a Symbol or Sequence<br><br>and<br><br> A Sequence is a bunch of Values<br><br>without getting into trouble with pyparsing, and that's important whenever
<br>we're dealing with things that have recursive structure... like:<br><br> country = {<br> tag = ENG<br> ai = {<br>
flags = { }<br>
combat = { DAU FRA ORL PRO }<br>
continent = { }<br>
area = { }<br>
region = { "British Isles"<br> "NorthSeaSea"<br> "ECAtlanticSea"<br> "NAtlanticSea"<br>
"TagoSea"<br> "WCAtlanticSea"
}<br>
war = 60<br>
ferocity = no }<br> }<br><br>Anyway, this is a really fast whirlwind tour of pyparsing, with some<br>intentional glossing-over of hard stuff, just so you get a better idea of<br>the core of parsing. Sorry if it went fast. *grin*
<br><br><br>If you have questions, please feel free to ask!<br><br></blockquote></div><br><br><br>-- <br>'There is only one basic human right, and that is to do as you damn well please.<br>And with it comes the only basic human duty, to take the consequences.'