[Tutor] general explanation of xml.sax and its usage

Paul Tremblay phthenry@earthlink.net
Fri May 16 18:15:03 2003


On Fri, May 16, 2003 at 02:54:57PM -0500, Isaac Hall wrote:
> Hi Folks,
> 
> A while back I posted on here asking for ways to parse XML into a python 
> dictionary.  Several people suggested xml.sax, but since the documentation 
> seemed to be a bit more complex than I really needed (I only needed to get 
> very simple XML) I abandoned that, and instead wrote my own little script 
> that just sorted xml into a dictionary by just reading the text.  That 
> worked well for its purpose, and continues to do so.  However, now for 
> another utility, I need to be able to parse much more complicated XML into 
> some kind of python object.  I am now looking at the docs for xml.sax, and 
> still,  the docs make it appear more complicated than it probably is.  Can 
> anyone point me to some sample code using xml.sax which is commented, so 
> that I may see it in use.  Or just a smiplified explanation of how to use 
> the objects in xml.sax would be very helpful.  
> 
> Thanks in advance
> 

Hi Ike

I am writing a lot of scripts that use SAX. 

SAX is very simple to use. The basic idea is that each time the start
of an element is found, SAX gives you a name, and a dictionary of
attributes.

Each time character data is found, it simply gives you the character
data.


And each time the end of an element is found, SAX gives you the name

That is essentially all SAX does. However, it is powerful because it
separates content from XML  markup. You also don't have to worry about
the form of the XML. SAX sees 

<tag attr ="value"  > 

and 

<tag 
attr ='value'> 


as identical.

Here is an example script that simply copies an XML file.

Paul

#!/usr/bin/python


import xml.sax.saxutils
import xml.sax

# turn on this line if you want to disablenamespaces
##from xml.sax.handler import feature_namespaces



import codecs




class CopyTree(xml.sax.saxutils.DefaultHandler):
    """

    A simple class that simply copies a tree

    """
    
    def __init__(self, write_obj):
        self.__character = ""
        self.__write_obj = write_obj


    def startElement(self, name, attrs):
        """

        Logic:

            The SAX driver uses this function when if finds a beginning tag.




        """
        self.__write_character()
        self.__name = name
        open_tag = '<%s' % name
        keys = attrs.keys()
        for key in keys:
            att = key
            value = attrs[key]
            open_tag += ' %s="%s"' % (att, value)
        open_tag += '>'
        self.__write_obj.write(open_tag)



    def __write_character(self):
        """

        Logic:

            simply writet he block of character data.

            Note:  you could do whatever you wanted with the data.
            
            For example, you could say:

		if self.__name == 'February':
                    return_data = self.__handle_Feb_data()

           This would allow you to look at data just between certain
tags.

        """

        self.__write_obj.write(self.__character)
        self.__character = ''
    
    def characters(self, character):
        """

        Logic:

            The SAX driver uses this method whenever it finds character data.
            You do not know how much character data it will dump at a time. It
            could be one character, on a thousand.

            """
        self.__character += character
                

    def endElement(self, name):
        """

        Logic:

            The SAX driver uses this method whenever it finds the end of an element.

        """
        self.__write_character()
        self.__write_obj.write('</%s>' % name)


if __name__ == '__main__':
    (utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup("utf-8")
    # create a write object
    output = '/home/paul/paultemp/test_output.xml'
    write_obj = utf8_writer(open(output, 'w'))

    # set up a parser object
    parser_obj = xml.sax.make_parser()

    # turn on this line if you want to disable namespaces
    ##parser.setFeature(feature_namespaces, 0)


    # create a content_handler object
    content_handler = CopyTree(write_obj = write_obj)

    # pass the object you just created to SAX
    parser_obj.setContentHandler(content_handler)

    # use SAX to parse the file
    file = '/home/paul/paultemp/test.xml'
    parser_obj.parse(file)             
    write_obj.close()


-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************