Ideas for parsing this text?
Eric Wertman
ewertman at gmail.com
Thu Apr 24 11:42:16 EDT 2008
Thanks to everyone for the help and feedback. It's amazing to me that
I've been dealing with odd log files and other outputs for quite a
while, and never really stumbled onto a parser as a solution.
I got this far, with Paul's help, which manages my current set of files:
from pyparsing import nestedExpr,Word,alphanums,QuotedString
from pprint import pprint
import re
import glob
files = glob.glob('wsout/*')
for file in files :
text = open(file).read()
text = re.sub('"\[',' [',text) # These 2 lines just drop double quotes
text = re.sub('\]"','] ',text) # that aren't related to a string
text = re.sub('\[\]','None',text) # this drops the empty []
text = '[ ' + text + ' ]' # Needs an outer layer
content = Word(alphanums+"-_./()*=#\\${}| :,;\t\n\r@?&%%") |
QuotedString('"',multiline=True)
structure = nestedExpr("[", "]", content).parseString(text)
pprint(structure[0].asList())
I'm sure there are cooler ways to do some of that. I spent most of my
time expanding the characters that constitute content. I'm concerned
that over time I'll have things break as other characters show up.
Specifically a few of the nodes are of German locale.. so I could get
some odd international characters.
It looks like pyparser has a constant for printable characters. I'm
not sure if I can just use that, without worrying about it?
At any rate, thumbs up on the parser! Definitely going to add to my toolbox.
On Thu, Apr 24, 2008 at 8:19 AM, Mark Wooding <mdw at distorted.org.uk> wrote:
>
> Eric Wertman <ewertman at gmail.com> wrote:
>
> > I have a set of files with this kind of content (it's dumped from
> > WebSphere):
> >
> > [propertySet "[[resourceProperties "[[[description "This is a required
> > property. This is an actual database name, and its not the locally
> > catalogued database name. The Universal JDBC Driver does not rely on
>
> > information catalogued in the DB2 database directory."]
> > [name databaseName]
> > [required true]
> > [type java.lang.String]
> > [value DB2Foo]] ...>
>
> Looks to me like S-expressions with square brackets instead of the
> normal round ones. I'll bet that the correct lexical analysis is
> approximately
>
> [ open-list
> propertySet symbol
> " open-string
> [ open-list
> [ open-list
> resourceProperties symbol
> " open-string (not close-string!)
> ...
>
> so it also looks as if strings aren't properly escaped.
>
> This is definitely not a pretty syntax. I'd suggest an initial
> tokenization pass for the lexical syntax
>
> [ open-list
> ] close-list
> "[ open-qlist
> ]" close-qlist
> "..." string
> whitespace ignore
> anything-else symbol
>
> Correct nesting should give you two kinds of lists -- which I've shown
> as `list' and `qlist' (for quoted-list), though given the nastiness of
> the dump you showed, there's no guarantee of correctness.
>
> Turn the input string (or file) into a list (generator?) of lexical
> objects above; then scan that recursively. The lists (or qlists) seem
> to have two basic forms:
>
> * properties, that is a list of the form [SYMBOL VALUE ...] which can
> be thought of as a declaration that some property, named by the
> SYMBOL, has a particular VALUE (or maybe VALUEs); and
>
> * property lists, which are just lists of properties.
>
> Property lists can be usefully turned into Python dictionaries, indexed
> by their SYMBOLs, assuming that they don't try to declare the same
> property twice.
>
> There are, alas, other kinds of lists too -- one of the property lists
> contains a property `[value []]' which simply contains an empty list.
>
> The right first-cut rule for disambiguation is probably that a property
> list is a non-empty list, all of whose items look like properties, and a
> property is an entry in a property list, and (initially at least)
> restrict properties to the simple form [SYMBOL VALUE] rather than
> allowing multiple values.
>
> Does any of this help?
>
> (In fact, this syntax looks so much like a demented kind of S-expression
> that I'd probably try to parse it, initially at least, by using a Common
> Lisp system's reader and a custom readtable, but that may not be useful
> to you.)
>
> -- [mdw]
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list