[Tutor] HTMLParser problem unable to find all the IMG tags....
Chris Barnhart
mlist-python at dideas.com
Thu Oct 28 14:34:37 CEST 2004
I'm trying to write a program that will locate the front page image at
CNN.com. [If this already exist, I want to do this anyway as its a good
learning exercise.]
The problem is that using the HTMLParser I'm not getting all the IMG
tags. I know this as I have another program that just uses string
processing that gets 2.5 times more IMG SRC tag. I also know this because
HTMLParser starttag is never called with the IMG that I'm after!
There is also an exception related to the close method and EOF.
Possibly my problem is in how I feed the data? Or related to nested tags?
Any ideas?
Thanks,
Chris
iimport urllib2
import HTMLParser
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__( self ) :
HTMLParser.__init__(self)
self.cnt = 0
def handle_starttag(self, tag, attr):
# print "Encountered the beginning of a %s tag" % tag
if (tag in "IMG" or tag in "img") :
self.cnt = self.cnt + 1
print tag,
def close(self) :
print
print "HTMLParse Found : ", self.cnt
HTMLParser.close(self)
mp = MyParser()
for line in urllib2.urlopen('http://www.cnn.com') :
mp.feed(line)
mp.close()
print "Finished"
More information about the Tutor
mailing list