Parsing XML RSS feed byte stream for <item> tag
John Gordon
gordon at panix.com
Thu Feb 7 16:00:38 EST 2013
In <16828a11-6c7c-4ab6-b406-6b8819883b5e at googlegroups.com> darrel.rendell at gmail.com writes:
> def pageReader(url):
> try:
> readPage =3D urllib2.urlopen(url)
> except urllib2.URLError, e:
> # print 'We failed to reach a server.'
> # print 'Reason: ', e.reason
> return 404 =20
> except urllib2.HTTPError, e:
> # print('The server couldn\'t fulfill the request.')
> # print('Error code: ', e.code) =20
> return 404 =20
> else:
> outputPage =3D readPage.read() =20
> return outputPage
> To recreate my error, simply call the above function with an argument
> similar to:
> http://www.cert.org/nav/cert_announcements.rss
> You'll see I'm trying to return the first child.
The above code produces no output at all. The pageReader() function is
defined but never called.
If we add a few lines at the bottom:
if __name__ == '__main__':
print pageReader('http://www.cert.org/nav/cert_announcements.rss')
Then we get some output:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>CERT Announcements</title>
<link>http://www.cert.org/nav/whatsnew.html</link>
<language>en-us</language>
<description>Announcements: What's New on the CERT web site</description>
<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>
...
> As I've said, BeautifulSoup fails to find both pubDate and Link, which are =
> crucial to my app.
> Any advice would be greatly appreciated.
You haven't included the BeautifulSoup code which attempts to parse the XML,
so it's impossible to say exactly what the error is.
However, I have a guess: you said you're trying to return the first
child. Based on the above output, the first child is the <channel>
element, not an <item> element. Perhaps that's the issue?
--
John Gordon A is for Amy, who fell down the stairs
gordon at panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
More information about the Python-list
mailing list