[XML-SIG] dumping an XML parser skeleton from DTD input

Ken MacLeod ken@bitsko.slc.ut.us
10 Mar 2001 12:55:02 -0600


Eugene.Leitl@lrz.uni-muenchen.de writes:

> "Thomas B. Passin" wrote:
> 
> > You are mixing up several concepts or processing steps.
> 
> I realize that. It comes from being a newbie with a deadline
> breathing down my neck.
>  
> > 1) Parsing  xml.

> > This means to get hold of the structural elements of the xml
> > document and give them to another application for further
> > processing.  There are many xml parsers out there, come command
> > line and some not.  It's almost certainly not worth it to roll
> > your own.
> 
> I know that, but apparently not my senior cow-orkers. It's a C/C++
> shop with an occasional sprinking of Java, my choice of Python is
> purely personal (note to myself: not to goof up this one).
>  
> Before I try selling them on the DOM thing, I'd rather know what I'm
> doing. It cost them three days to whip up their object tree XML
> parser in Java.
> 
> > 2) Creating a tree-like structure to represent the structure of
> > the xml document.  The DOM is an API for a tree-like
> > representation.  Most major parsers out there either include a DOM
> > api or can work with another DOM API.  (SAX is a non-DOM api, but
> > the output of a sax processsor can be used to build a tree, too).
> > The DOM is an object oriented api.
> 
> They (said cow-orkers) insist on an object tree based approach.

Note that DOM objects are a raw, in-memory version of the XML document
(objects representing XML elements, attributes, text nodes).  What you
(or your coworkers) are probably wanting are normal application
objects exported and imported via XML.

The way your coworkers seemed to have started is to create a unique
XML format for each application object or file, and then write
per-file importers and exporters for each format.

As you suspected, there is probably a way to refactor this code so
that you need only have one importer and exporter regardless of which
application objects or file format is used.  Your first post suggested
having some kind of "DTD compiler" that could digest a DTD and produce
a per-file "parser" for you, for reading in arbitrary XML.

Practically speaking, that's a hard problem.  The difficulty is that
each XML format is being created "by hand" unique and tweaked to each
application object, you're expecting some kind of compiler to
generalize the XML and re-create usable application objects from the
various uniquely designed formats.

So what's the easy way?  Instead of creating a unique format by hand
for each application object, create a set of generic encoding rules
for converting any type of object into XML, and then write a parser to
read the generic XML and convert it into objects.

SOAP is one such set of encoding rules (SOAP Section 5, to be exact),
and if you're comfortable with using the SOAP libraries to read and
write XML, I would highly recommend going that way.  The problem is
that most SOAP libraries are a little tedious to use for "just
serializing objects" (thinking of Apache Java SOAP here in
particular).

To roll your own, you just need a set of simple rules for encoding.
Here's an example XML:

  <top>
    <field1>A simple value in a record, structure, or object</field1>
    <field2 isArray="1">
      <item>A simple value in a list</item>
      <item>
        <subfield1>A simple value, in a strcture, in a list</subfield1>
        <subfield2>12345</subfield2>
      </item>
    </field2>
    <field3>
      <subfieldA>A simple value, in a structure, in a structure</subfieldA>
      <subfieldB>12345</subfieldB>
    </field3>
  </top>

The rules are:

  1) If an XML element contains subelements, then the value is an
     array or a structure.

  2) The sub-element names of structures (objects) are the field, key,
     or member names of the structure or object.

  3) An array is indicated by an attribute isArray="1".

  4) The sub-element names of an array are arbitrary, so you can pick
     something like <item>.

  5) If an element has no sub-elements, then that element is a simple
     value (a string, integer, date, whatever).

I didn't put this in the example, but it's easiest to store type
information for every element, whether it be a class name on a
structure or list, or a simple value type (string, integer, date) on a
simple value.  Use an attribute like type="someType".

Here's the relevant part of a decoder for this format, converted by
hand from the Orchard SOAP parser[1], it should give you a start.
Note that it's not trying to decode the class names of objects, but
when you want to do that, add the code to the endElement handler in
the 'else' clause of the 'if utype is _CHAR'.

import xml.sax

# just constants
_DICT = "dict"
_ARRAY = "array"
_CHAR = "char"

class Unpickler:
    def __init__(self, file):
        self.file = file

    def load(self):
        self.parse_value_stack = [ {} ]
        self.parse_utype_stack = [ _DICT ]
        self.parse_type_stack = [ ]

        parser = xml.sax.make_parser()
        parser.setContentHandler(self)
        parser.setErrorHandler(self)
        parser.parse(file)
        object = self.parse_value_stack[0]
        delattr(self, 'parse_value_stack')
        return object

    def startElement(self, name, atts):
        self.chars = ""

        type = None
        if atts.has_key('type'):
            type = atts['type']
        self.parse_type_stack.append(type)

        if atts.has_key('isArray'):
            self.parse_utype_stack.append(_ARRAY)
            self.parse_value_stack.append( [ ] )
        else:
            # will be set to _DICT if a sub-element is found
            self.parse_utype_stack.append(_CHAR)

    def endElement(self, name):
        type = self.parse_type_stack.pop()
        utype = self.parse_utype_stack.pop()

        if utype is _CHAR:
            if type == 'integer':
                value = int(self.chars)
            elif type == 'float':
                value = float(self.chars)
            else:
                value = self.chars
        else:
            value = self.parse_value_stack.pop()

        # if we're in an element, and our parent element was defaulted
        # to _CHAR, then we're in a struct and we need to create that
        # dictionary.
        if self.parse_utype_stack[-1] is _CHAR:
            self.parse_value_stack.append( {} )
            self.parse_utype_stack[-1] = _DICT

        if self.parse_utype_stack[-1] is _DICT:
            self.parse_value_stack[-1][name] = value
        else:
            self.parse_value_stack[-1].append(value)

    def characters(self, chars):
        self.chars = self.chars + chars.data

    def startDocument(self): pass
    def endDocument(self): pass
    def ignorableWhitespace(self, ch, start, length): pass
    def processingInstruction(self, target, data): pass
    def error(self, exc): raise exc
    def fatalError(self, exc): raise exc
    def warning(self, exc): pass

In C++ or Java, you might consider having each class you expect to be
ex/imported from XML to have a constructor that accepts a dictionary
from the XML reader (to create the new object just read from XML) and
a method asDictionary() that will return the representation of the
object as a dictionary (to be written to XML).

  -- Ken

[1] <http://casbah.org/~kmacleod/orchard/SOAP.py>