[Tutor] XML Parser

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Wed, 8 Aug 2001 00:59:58 -0700 (PDT)


On Wed, 8 Aug 2001, Praveen Pathiyil wrote:

>         I am trying out some basic stuff on XML with Python. But I am
> unable to get a parser instance on my system ( I have cut and pasted
> some code which i tried out for this.). Also the "drivers" modules
> under xml.sax and "pyexpat" is not installed. (Since i am working on
> the server of my company, I can't install these packages myself).
>         How do I proceed to get a parser instance ?
> 


Hello!  Let's take a look!

[warning: this message is a little long]



xml.sax.XMLReader is not meant to be used directly: it is "documentation",
that is, it's an interface that says that anything that parses XML should
do the sort of functions listed in:

    http://www.python.org/doc/current/lib/module-xml.sax.xmlreader.html


To actually get at a real XML parser, try xml.sax.make_parser() instead:

###
>>> parser = xml.sax.make_parser()
###

Python will search our system for an approprate XML parser; on many
systems, Python will use xml.parsers.expat.  Try this first, and see if an
error message pops up or not.  If an error occurs, then you'll probably
need to install Expat from:

http://sourceforge.net/project/showfiles.php?group_id=10127&release_id=45670


If you do get that exception, email us, and we can try fixing it.  
Anyway, let's assume that you didn't run into an error.  *grin*


Once we have this real parser, then we can start calling XMLReader'ish
methods on it, and be fairly sure that these methods will work:

###
>>> parser.parse
<method ExpatParser.parse of ExpatParser instance at 0x80d276c>
>>> parser.setContentHandler
<method XMLReader.setContentHandler of ExpatParser instance at
0x80d276c>
###



Now we need to tell the parser what sort of content we're interested
in.  If you look at the documentation on xmlreader, you'll see a bunch of
setFoobar()-style functions: we'll need to do something with that.  This
will be easier if we have some sort of concrete example to work
with.


If we have some sort of document, like:

###
mydoc = """
<html>
<body><h1>Philip and Alex's Guide to Web Publishing</h1>
      <p>This is a really neat book.</p>
      <h1>The Practical SQL Handbook</h1>
      <p>This is also a neat book.</p>
      <h1>The Lord of the Rings</h1>
      <p>Frodo Lives!</p>
</body>"""
###

it might be nice to write something that parses out all the <h1> headers
in a document.  What we'll do will look very similar to someone who has
played with sgmllib:  we'll be writing a subclass that can look at
specific things, and that means writing a ContentHandler:

    (http://www.python.org/doc/current/lib/content-handler-objects.html)

Here's one example of such a handler:


###
class HeadlineHandler(ContentHandler):
    """We're just interested in <h1> headlines..."""
    def __init__(self):
        self.in_h1 = 0


    def startElement(self, name, attrs):
        if name == 'h1': self.in_h1 = 1


    def endElement(self, name):
        if name == 'h1': self.in_h1 = 0


    def characters(self, content):
        if self.in_h1:
            print content
###




Let's put all the pieces together, and see if this contraption can fly:

###
import xml.sax
from xml.sax.handler import ContentHandler
from StringIO import StringIO

class HeadlineHandler(ContentHandler):
    """We're just interested in <h1> headlines..."""
    def __init__(self):
        self.in_h1 = 0


    def startElement(self, name, attrs):
        if name == 'h1': self.in_h1 = 1


    def endElement(self, name):
        if name == 'h1': self.in_h1 = 0


    def characters(self, content):
        if self.in_h1:
            print content


if __name__ == '__main__':
    mydoc = """
    <html>
    <body><h1>Philip and Alex's Guide to Web Publishing</h1>
          <p>This is a really neat book.</p>
          <h1>The Practical SQL Handbook</h1>
          <p>This is also a neat book.</p>
          <h1>The Lord of the Rings</h1>
          <p>Frodo Lives!</p>
    </body></html>"""
    parser = xml.sax.make_parser()
    handler = HeadlineHandler()
    parser.setContentHandler(handler)
    parser.parse(StringIO(mydoc))

    ## a little silliness since parser.parse needs a "file".  Much
    ## easier would have been to say, more directly:
    ##
    ##    handler = HeadlineHandler() 
    ##    xml.sax.parseString(mydoc, handler)
    ##
###


###
dyoo@coffeetable:~$ python xmltest.py
Philip and Alex's Guide to Web Publishing
The Practical SQL Handbook
The Lord of the Rings
###


I have to stop now, or I won't be able to wake up tomorrow.  *grin* I know
this is insufficient, but I hope it's somewhat helpful.  Good luck!