a few more questions on XML and python

Alex Martelli aleax at aleax.it
Fri Jan 4 08:30:18 EST 2002


"Rajarshi Guha" <rxg218 at psu.edu> wrote in message
news:a12o76$18va at r02n01.cac.psu.edu...
    ...
> I'm a little confused as to how expat is supposed to handle an arbitrary
> XML file where the tags could be describing anything.

See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65248
for a specific example of using expat.  However, most often you would
instead use the more flexible interface SAX, as per example in
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65127.

An excellent general introduction to XML and Python is at
http://www.oreilly.com/catalog/pythonxml/chapter/ch01.html (you'll
probably want to buy the whole book after reading this superb
first chapter).

Note that, be it with expat or SAX, the parsing is "event driven".
When an opening tag is encountered, what follows it is not known
yet.  So, what you normally do:

    in the start-handler for the tag of your interest, you save
    its attributes (all that's known so far) and prepare containers
    for embedded tags and text that may be needed;

    when tags or characters are received after you've seen the
    open-tag of interest but before the close-tag, you save the
    relevant information in the containers;

    at close-handler time, you process the saved information for
    the tag.

Let's take a simple example.  Say that an XML-marked-up text,
whatever else it may contain, has a tag called 'coordinate',
with a mandatory attribute 'name'; between <coordinate name="x">
and the corresponding </coordinate>, only character data may
be present (no need to worry about other embedded tags, except
maybe to diagnose an error and terminate processing).

Given such an XML file, you want to output coordinate data only
in the form of printing:

name -> coordinate data

to standard output.

OK so far?

Here's one possible approach, then:

import xml.sax

class handler(xml.sax.handler.ContentHandler):
    def startDocument(self):
        self.current_data = None
        self.current_name = None
    def endDocument(self):
        assert self.current_data is None
        assert self.current_name is None
    def startElement(self, name, attr):
        assert self.current_data is None
        assert self.current_name is None
        if name=='coordinate':
            self.current_data = []
            self.current_name = attr.get('name')
    def endElement(self, name):
        if name=='coordinate':
            assert self.current_data is not None
            assert self.current_name is not None
            print "%s -> %s" % (self.current_name,
                ''.join(self.current_data))
            self.current_data = None
            self.current_name = None
    def characters(self, content):
        if self.current_data is not None:
            self.current_data.append(content)
    def ignorableWhitespace(self, ws):
        if self.current_data is not None:
            self.current_data.append(ws)

# and some tiny self-testing...:
if __name__=='__main__':
    x = '''<?xml version="1.0" encoding="ISO8859-1"?>
    <blobof>
    One <coordinate name="a">23 45</coordinate> two <plik/>
    three <coordinate name="b">42 68</coordinate> four and
    <some>other</some> tag.
    </blobof>
    '''
    flob = open('someinput.xml', 'w')
    flob.write(x)
    flob.close()

    xml.sax.parse('someinput.xml', handler())


Normally, you want to process several different tags, and
testing for each case in startElement and endElement is not
elegant nor productive.  Then, you can dispatch on tag name
in each of these methods, in several possible ways -- Python
makes it easy to do so via introspection, and that's a
common tack to take (by imitation of sgmllib for example).
E.g.,

    def startElement(self, name, attr):
        try: method = getattr(self, 'start_'+name)
        except AttributeError: pass
        else: method(attr)
    def endElement(self, name):
        try: method = getattr(self, 'end_'+name)
        except AttributeError: pass
        else: method(attr)

and then you'd code the blocks that in the above example are
guarded by the "if name=='coordinate':" clauses into methods
    def start_coordinate(self, attr):
and
    def end_coordinate(self, attr):

But this doesn't deeply change the nature of what's going on.
It's still basically a game of preparing object-state in the
start-tag methods, enriching it in method characters, and
processing the accumulated stuff in the end-tag methods.


Alex






More information about the Python-list mailing list