Ideas for parsing this text?

Mark Wooding mdw at distorted.org.uk
Thu Apr 24 08:19:46 EDT 2008


Eric Wertman <ewertman at gmail.com> wrote:

> I have a set of files with this kind of content (it's dumped from
> WebSphere):
>
> [propertySet "[[resourceProperties "[[[description "This is a required
> property. This is an actual database name, and its not the locally
> catalogued database name. The Universal JDBC Driver does not rely on
> information catalogued in the DB2 database directory."]
> [name databaseName]
> [required true]
> [type java.lang.String]
> [value DB2Foo]] ...>

Looks to me like S-expressions with square brackets instead of the
normal round ones.  I'll bet that the correct lexical analysis is
approximately

  [			open-list
  propertySet		symbol
  "			open-string
  [			open-list
  [			open-list
  resourceProperties	symbol
  "			open-string	(not close-string!)
  ...

so it also looks as if strings aren't properly escaped.

This is definitely not a pretty syntax.  I'd suggest an initial
tokenization pass for the lexical syntax

  [			open-list
  ]			close-list
  "[			open-qlist
  ]"			close-qlist
  "..."			string
  whitespace		ignore
  anything-else		symbol

Correct nesting should give you two kinds of lists -- which I've shown
as `list' and `qlist' (for quoted-list), though given the nastiness of
the dump you showed, there's no guarantee of correctness.

Turn the input string (or file) into a list (generator?) of lexical
objects above; then scan that recursively.  The lists (or qlists) seem
to have two basic forms:

  * properties, that is a list of the form [SYMBOL VALUE ...] which can
    be thought of as a declaration that some property, named by the
    SYMBOL, has a particular VALUE (or maybe VALUEs); and

  * property lists, which are just lists of properties.

Property lists can be usefully turned into Python dictionaries, indexed
by their SYMBOLs, assuming that they don't try to declare the same
property twice.

There are, alas, other kinds of lists too -- one of the property lists
contains a property `[value []]' which simply contains an empty list.

The right first-cut rule for disambiguation is probably that a property
list is a non-empty list, all of whose items look like properties, and a
property is an entry in a property list, and (initially at least)
restrict properties to the simple form [SYMBOL VALUE] rather than
allowing multiple values.

Does any of this help?

(In fact, this syntax looks so much like a demented kind of S-expression
that I'd probably try to parse it, initially at least, by using a Common
Lisp system's reader and a custom readtable, but that may not be useful
to you.)

-- [mdw]



More information about the Python-list mailing list