trouble pyparsing

Wed Jan 4 21:43:55 EST 2006

"the.theorist" <the.theorist at gmail.com> wrote in message
news:1136422425.209587.100500 at z14g2000cwz.googlegroups.com...
> Hey, I'm trying my hand and pyparsing a log file (named l.log):
> FIRSTLINE
>
> PROPERTY1      DATA1
> PROPERTY2      DATA2
>
> PROPERTYS LIST
>         ID1     data1
>         ID2     data2
>
>         ID1     data11
>         ID2     data12
>
> SECTION
>
> So I wrote up a small bit of code (named p.py):
> from pyparsing import *
> import sys
>
> toplevel = Forward()
>
> firstLine = Word('FIRSTLINE')
> property  = (Word('PROPERTY1') + Word(alphanums)) ^ (Word('PROPERTY2')
> + Word(alphanums))
>
> id        = (Word('ID1') + Word(alphanums)) ^ (Word('ID2') +
> Word(alphanums))
> plist     = Word('PROPERTYS LIST') + ZeroOrMore( id )
>
> toplevel << firstLine
> toplevel << OneOrMore( property )
> toplevel << plist
>
> par = toplevel
>
> print toplevel.parseFile(sys.argv[1])
>
> The problem is that I get the following error:
<snip>
> Is this a fundamental error, or is it just me? (I haven't yet tried
> simpleparse)
>

It's you.

Well, let's focus on the behavior and not the individual.  There are two
major misconceptions that you have here:
1. Confusing "Word" for "Literal"
2. Confusing "<<" Forward assignment for some sort of C++ streaming
operator.

What puzzles me is that in some places, you correctly use the Word class, as
in Word(alphanums), to indicate a "word" as a contiguous set of characters
found in the string alphanums.  You also correctly use '+' to build up id
and plist expressions, but then you use "<<" successively in what looks like
streaming into the toplevel variable.

When your grammar includes Word("FIRSTLINE"), you are actually saying you
want to match a "word" composed of one ore more letters found in the string
"FIRSTLINE" - this would match not only FIRSTLINE, but also FIRST, LINE,
LIRST, FINE, LIST, FIST, FLINTSTRINE, well, you get the idea.  Just the way
Word(alphanums) matches DATA1, DATA2, data1, data2, data11, and data12.

What you really want here is the class Literal, as in Literal("FIRSTLINE").

As for toplevel, there is no reason here to use Forward() - reserve use of
this class for recursive structures, such as lists composed of lists, etc.
toplevel is simply the sequence of a firstline, OneOrMore properties, and a
plist, which is just the plain old:

toplevel = firstline + OneOrMore(property) + plist

Lastly, if you'll peruse the documentation that comes with pyparsing, you'll
also find the Group class.  This class is very helpful in imparting some
structure to the returned set of tokens.

Here is a before/after version of your program, that has some more
successful results.

-- Paul

data = """FIRSTLINE

PROPERTY1      DATA1
PROPERTY2      DATA2

PROPERTYS LIST
        ID1     data1
        ID2     data2

        ID1     data11
        ID2     data12

SECTION
"""

from pyparsing import *
import sys

#~ toplevel = Forward()

#~ firstLine = Word('FIRSTLINE')
firstLine = Literal('FIRSTLINE')

#~ property  = (Word('PROPERTY1') + Word(alphanums)) ^ (Word('PROPERTY2') +
Word(alphanums))
property  = (Literal('PROPERTY1') + Word(alphanums)) ^ (Literal('PROPERTY2')
+ Word(alphanums))

#~ id        = (Word('ID1') + Word(alphanums)) ^ (Word('ID2') +
Word(alphanums))
id        = (Literal('ID1') + Word(alphanums)) ^ (Literal('ID2') +
Word(alphanums))

#~ plist     = Word('PROPERTYS LIST') + ZeroOrMore( id )
plist     = Literal('PROPERTYS LIST') + ZeroOrMore( id )

#~ toplevel << firstLine
#~ toplevel << OneOrMore( property )
#~ toplevel << plist
toplevel = firstLine + OneOrMore( property ) + plist

par = toplevel

print par.parseString(data)

# add Groups, to give structure to results, rather than just returning a
flat list of strings
plist     = Literal('PROPERTYS LIST') + ZeroOrMore( Group(id) )
toplevel = firstLine + Group(OneOrMore(Group(property))) + Group(plist)

par = toplevel

print par.parseString(data)