Trouble using XML Reader

Mike D 42flicks at gmail.com
Tue Mar 4 10:14:38 CET 2008


On 3/3/08, Mike D <42flicks at gmail.com> wrote:
>
> Hello,
>
> I'm using XML Reader (xml.sax.xmlreader.XMLReader) to create an rss
> reader.
>
> I can parse the file but am unsure how to extract the elements I require.
> For example: For each <item> element I want the title and description.
>
> I have some stub code; I want to create a list of objects which include a
> title and description.
>
> I have the following code (a bit hacked up):
>
> import sys
> from xml.sax import make_parser
> from xml.sax import handler
>
> class rssObject(object):
>     objectList=[]
>     def addObject(self,object):
>         rssObject.objectList.append(object)
>
> class rssObjectDetail(object):
>     title = ""
>     content = ""
>
>
> class SimpleHandler(handler.ContentHandler):
>     def startElement(self,name,attrs):
>         print name
>
>     def endElement(self,name):
>         print name
>
>     def characters(self,data):
>         print data
>
>
> class SimpleDTDHandler(handler.DTDHandler):
>     def notationDecl(self,name,publicid,systemid):
>         print "Notation: " , name, publicid, systemid
>
>     def unparsedEntityDecl(self,name,publicid,systemid):
>         print "UnparsedEntity: " , name, publicid, systemid, ndata
>
> p= make_parser()
> c = SimpleHandler()
> p.setContentHandler(c)
> p.setDTDHandler(SimpleDTDHandler())
> p.parse('topstories.xml')
>
> And am using this xml file:
>
> <?xml version="1.0"?>
> <rss version="2.0">
>   <channel>
>     <title>Stuff.co.nz - Top Stories</title>
>     <link>http://www.stuff.co.nz</link>
>     <description>Top Stories from Stuff.co.nz. New Zealand, world, sport,
> business & entertainment news on Stuff.co.nz. </description>
>     <language>en-nz</language>
>     <copyright>Fairfax New Zealand Ltd.</copyright>
>     <ttl>30</ttl>
>     <image>
>       <url>/static/images/logo.gif</url>
>       <title>Stuff News</title>
>       <link>http://www.stuff.co.nz</link>
>     </image>
>
> <item id="4423924" count="1">
> <title>Prince Harry 'wants to live in Africa'</title>
> <link>http://www.stuff.co.nz/4423924a10.html?source=RSStopstories_20080303
> </link>
> <description>For Prince Harry it must be the ultimate dark irony: to be in
> such a privileged position and have so much opportunity, and yet be unable
> to fulfil a dream of fighting for the motherland.</description>
> <author>EDMUND TADROS</author>
> <guid isPermaLink="false">stuff.co.nz/4423924</guid>
> <pubDate>Mon, 03 Mar 2008 00:44:00 GMT</pubDate>
> </item>
>
>   </channel>
> </rss>
>
> Is there something I'm missing? I can't figure out how to correctly
> interpret the document using the SAX parser. I'm sure I;'m missing something
> obvious :)
>
> Any tips or advice would be appreciated! Also advice on correctly
> implementing what I want to achieve would be appreciated as using
> objectList=[] in the ContentHandler seems like a hack.
>
> Thanks!
>

My mistake, The provided example is a SAX object, which can be parsed with
DOM manipulation. I'll be able to do it now :)

Oh, I also
posted a hacked up implementation, I understand my classes look awful!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080304/8751a175/attachment.html>


More information about the Python-list mailing list