sax

Alex Martelli aleaxit at yahoo.com
Fri Oct 27 18:36:19 EDT 2000


"Hwanjo Yu" <hwanjoyu at uiuc.edu> wrote in message
news:MGlK5.684$pr6.11046 at vixen.cso.uiuc.edu...
> Hi,
>
> It is hard to understand sax module without an example.
> Could someone get me an example of how to use sax to parse a xml file

Let's take a typically-toy example.  Say we often need to filter
XML files extracting only the text content that is within a given
named tag.  A Python sax approach to that would be, e.g.:


import xml.sax as sax, xml.sax.handler as h

class handle(h.ContentHandler, h.DTDHandler, h.EntityResolver):
    def __init__(self, trigger):
        self.trigger=trigger
        self.pieces=[]
        self.inTrigger=0
        self.currpiece=[]
    def startElement(self, name, attrs):
        if name==self.trigger:
            self.inTrigger+=1
    def characters(self, content):
        if self.inTrigger:
            self.currpiece.append(content)
    def endElement(self, name):
        if name==self.trigger:
            self.inTrigger-=1
            if not self.inTrigger:
                self.pieces.append(''.join(self.currpiece))
                self.currpiece=[]
    def endDocument(self):
        print "found %d pieces" % len(self.pieces)
        for piece in self.pieces:
            print "    ",piece

I've defined the methods in the order they would typically
be called.  The instance keeps several elements of state:
a "trigger", the tagname of interest; a list of "pieces" of
text found within such tags; the current "piece" being
collected, if any; and, a flag that remembers how many
instances of the trigger are currently open.  __init__ is
pretty obvious.  startElement is called at each element
start, and increments the inTrigger count if appropriate.

characters is called when textual data is met (the text
can be variously split up, depending on the parser, so
it's normally accumulated until we know we have all of
a 'piece of interest').  endElement is called as each
element ends, and decrements the inTrigger count if
appropriate; when inTrigger is decremented back to
0, the current-piece is joined into one string which
gets appended to the pieces list, then it's set to empty
again.

endDocument is called but once, when the document
is finished, and here we just print out each piece in
the list we collected.

Here is test-code with which this module would normally
terminate:


def test():
    examp = """<?xml version="1.0"?>
    <foo bar="baz">
        <qux a="b"/>
        <plek>Hello</plek>
        there
        <plek>World</plek>
    </foo>
    """
    sax.parseString(examp, handle('plek'))

if __name__=='__main__':
    test()


Of course, there is much more, but these are the most
basic of basics...


Alex






More information about the Python-list mailing list