Ideas for parsing this text?

Eric Wertman ewertman at gmail.com
Thu Apr 24 11:42:16 EDT 2008


Thanks to everyone for the help and feedback.  It's amazing to me that
I've been dealing with odd log files and other outputs for quite a
while, and never really stumbled onto a parser as a solution.


I got this far, with Paul's help, which manages my current set of files:

from pyparsing import nestedExpr,Word,alphanums,QuotedString
from pprint import pprint
import re
import glob

files = glob.glob('wsout/*')

for file in files :
    text = open(file).read()
    text = re.sub('"\[',' [',text)       # These 2 lines just drop double quotes
    text = re.sub('\]"','] ',text)       # that aren't related to a string
    text = re.sub('\[\]','None',text) # this drops the empty []
    text = '[ ' + text + ' ]'              # Needs an outer layer

    content = Word(alphanums+"-_./()*=#\\${}| :,;\t\n\r@?&%%") |
QuotedString('"',multiline=True)
    structure = nestedExpr("[", "]", content).parseString(text)

    pprint(structure[0].asList())

I'm sure there are cooler ways to do some of that.  I spent most of my
time expanding the characters that constitute content.  I'm concerned
that over time I'll have things break as other characters show up.
Specifically a few of the nodes are of German locale.. so I could get
some odd international characters.

It looks like pyparser has a constant for printable characters.  I'm
not sure if I can just use that, without worrying about it?

At any rate, thumbs up on the parser!  Definitely going to add to my toolbox.


On Thu, Apr 24, 2008 at 8:19 AM, Mark Wooding <mdw at distorted.org.uk> wrote:
>
> Eric Wertman <ewertman at gmail.com> wrote:
>
> > I have a set of files with this kind of content (it's dumped from
> > WebSphere):
> >
> > [propertySet "[[resourceProperties "[[[description "This is a required
> > property. This is an actual database name, and its not the locally
> > catalogued database name. The Universal JDBC Driver does not rely on
>
> > information catalogued in the DB2 database directory."]
> > [name databaseName]
> > [required true]
> > [type java.lang.String]
> > [value DB2Foo]] ...>
>
> Looks to me like S-expressions with square brackets instead of the
> normal round ones.  I'll bet that the correct lexical analysis is
> approximately
>
>  [                     open-list
>  propertySet           symbol
>  "                     open-string
>  [                     open-list
>  [                     open-list
>  resourceProperties    symbol
>  "                     open-string     (not close-string!)
>  ...
>
> so it also looks as if strings aren't properly escaped.
>
> This is definitely not a pretty syntax.  I'd suggest an initial
> tokenization pass for the lexical syntax
>
>  [                     open-list
>  ]                     close-list
>  "[                    open-qlist
>  ]"                    close-qlist
>  "..."                 string
>  whitespace            ignore
>  anything-else         symbol
>
> Correct nesting should give you two kinds of lists -- which I've shown
> as `list' and `qlist' (for quoted-list), though given the nastiness of
> the dump you showed, there's no guarantee of correctness.
>
> Turn the input string (or file) into a list (generator?) of lexical
> objects above; then scan that recursively.  The lists (or qlists) seem
> to have two basic forms:
>
>  * properties, that is a list of the form [SYMBOL VALUE ...] which can
>    be thought of as a declaration that some property, named by the
>    SYMBOL, has a particular VALUE (or maybe VALUEs); and
>
>  * property lists, which are just lists of properties.
>
> Property lists can be usefully turned into Python dictionaries, indexed
> by their SYMBOLs, assuming that they don't try to declare the same
> property twice.
>
> There are, alas, other kinds of lists too -- one of the property lists
> contains a property `[value []]' which simply contains an empty list.
>
> The right first-cut rule for disambiguation is probably that a property
> list is a non-empty list, all of whose items look like properties, and a
> property is an entry in a property list, and (initially at least)
> restrict properties to the simple form [SYMBOL VALUE] rather than
> allowing multiple values.
>
> Does any of this help?
>
> (In fact, this syntax looks so much like a demented kind of S-expression
> that I'd probably try to parse it, initially at least, by using a Common
> Lisp system's reader and a custom readtable, but that may not be useful
> to you.)
>
> -- [mdw]
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list