[Tutor] XML Parser

Praveen Pathiyil ppathiyi@cisco.com
Thu, 9 Aug 2001 10:17:20 +0530


Thanks Danny ...

After going thru your mail, I understood how to use these functions ..

But can you explain the concept behind the interface functions and how
python achieves this by having classes for these functions ? I guess it is
implemented as abstract base classes and we are supposed to have derived
classes of our own where we should override the functions with our own
function implementations. But a formal explanation would be a great thing
....

Thanks,
Praveen.

----- Original Message -----
From: "Danny Yoo" <dyoo@hkn.eecs.berkeley.edu>
To: "Praveen Pathiyil" <ppathiyi@cisco.com>
Cc: <tutor@python.org>; <xml-sig@python.org>
Sent: Wednesday, August 08, 2001 1:29 PM
Subject: Re: [Tutor] XML Parser


> On Wed, 8 Aug 2001, Praveen Pathiyil wrote:
>
> >         I am trying out some basic stuff on XML with Python. But I am
> > unable to get a parser instance on my system ( I have cut and pasted
> > some code which i tried out for this.). Also the "drivers" modules
> > under xml.sax and "pyexpat" is not installed. (Since i am working on
> > the server of my company, I can't install these packages myself).
> >         How do I proceed to get a parser instance ?
> >
>
>
> Hello!  Let's take a look!
>
> [warning: this message is a little long]
>
>
>
> xml.sax.XMLReader is not meant to be used directly: it is "documentation",
> that is, it's an interface that says that anything that parses XML should
> do the sort of functions listed in:
>
>     http://www.python.org/doc/current/lib/module-xml.sax.xmlreader.html
>
>
> To actually get at a real XML parser, try xml.sax.make_parser() instead:
>
> ###
> >>> parser = xml.sax.make_parser()
> ###
>
> Python will search our system for an approprate XML parser; on many
> systems, Python will use xml.parsers.expat.  Try this first, and see if an
> error message pops up or not.  If an error occurs, then you'll probably
> need to install Expat from:
>
>
http://sourceforge.net/project/showfiles.php?group_id=10127&release_id=45670
>
>
> If you do get that exception, email us, and we can try fixing it.
> Anyway, let's assume that you didn't run into an error.  *grin*
>
>
> Once we have this real parser, then we can start calling XMLReader'ish
> methods on it, and be fairly sure that these methods will work:
>
> ###
> >>> parser.parse
> <method ExpatParser.parse of ExpatParser instance at 0x80d276c>
> >>> parser.setContentHandler
> <method XMLReader.setContentHandler of ExpatParser instance at
> 0x80d276c>
> ###
>
>
>
> Now we need to tell the parser what sort of content we're interested
> in.  If you look at the documentation on xmlreader, you'll see a bunch of
> setFoobar()-style functions: we'll need to do something with that.  This
> will be easier if we have some sort of concrete example to work
> with.
>
>
> If we have some sort of document, like:
>
> ###
> mydoc = """
> <html>
> <body><h1>Philip and Alex's Guide to Web Publishing</h1>
>       <p>This is a really neat book.</p>
>       <h1>The Practical SQL Handbook</h1>
>       <p>This is also a neat book.</p>
>       <h1>The Lord of the Rings</h1>
>       <p>Frodo Lives!</p>
> </body>"""
> ###
>
> it might be nice to write something that parses out all the <h1> headers
> in a document.  What we'll do will look very similar to someone who has
> played with sgmllib:  we'll be writing a subclass that can look at
> specific things, and that means writing a ContentHandler:
>
>     (http://www.python.org/doc/current/lib/content-handler-objects.html)
>
> Here's one example of such a handler:
>
>
> ###
> class HeadlineHandler(ContentHandler):
>     """We're just interested in <h1> headlines..."""
>     def __init__(self):
>         self.in_h1 = 0
>
>
>     def startElement(self, name, attrs):
>         if name == 'h1': self.in_h1 = 1
>
>
>     def endElement(self, name):
>         if name == 'h1': self.in_h1 = 0
>
>
>     def characters(self, content):
>         if self.in_h1:
>             print content
> ###
>
>
>
>
> Let's put all the pieces together, and see if this contraption can fly:
>
> ###
> import xml.sax
> from xml.sax.handler import ContentHandler
> from StringIO import StringIO
>
> class HeadlineHandler(ContentHandler):
>     """We're just interested in <h1> headlines..."""
>     def __init__(self):
>         self.in_h1 = 0
>
>
>     def startElement(self, name, attrs):
>         if name == 'h1': self.in_h1 = 1
>
>
>     def endElement(self, name):
>         if name == 'h1': self.in_h1 = 0
>
>
>     def characters(self, content):
>         if self.in_h1:
>             print content
>
>
> if __name__ == '__main__':
>     mydoc = """
>     <html>
>     <body><h1>Philip and Alex's Guide to Web Publishing</h1>
>           <p>This is a really neat book.</p>
>           <h1>The Practical SQL Handbook</h1>
>           <p>This is also a neat book.</p>
>           <h1>The Lord of the Rings</h1>
>           <p>Frodo Lives!</p>
>     </body></html>"""
>     parser = xml.sax.make_parser()
>     handler = HeadlineHandler()
>     parser.setContentHandler(handler)
>     parser.parse(StringIO(mydoc))
>
>     ## a little silliness since parser.parse needs a "file".  Much
>     ## easier would have been to say, more directly:
>     ##
>     ##    handler = HeadlineHandler()
>     ##    xml.sax.parseString(mydoc, handler)
>     ##
> ###
>
>
> ###
> dyoo@coffeetable:~$ python xmltest.py
> Philip and Alex's Guide to Web Publishing
> The Practical SQL Handbook
> The Lord of the Rings
> ###
>
>
> I have to stop now, or I won't be able to wake up tomorrow.  *grin* I know
> this is insufficient, but I hope it's somewhat helpful.  Good luck!
>
>