[Tutor] HTMLParser problem unable to find all the IMG tags....

Chris Barnhart mlist-python at dideas.com
Thu Oct 28 14:34:37 CEST 2004


I'm trying to write a program that will locate the front page image at 
CNN.com.  [If this already exist, I want to do this anyway as its a good 
learning exercise.]

The problem is that using the HTMLParser I'm not getting all the IMG 
tags.  I know this as I have another program that just uses string 
processing that gets 2.5 times more IMG SRC tag.  I also know this because 
HTMLParser starttag is never called with the IMG that I'm after!

There is also an exception related to the close method and EOF.
Possibly my problem is in  how I feed the data? Or related to nested tags?
Any ideas?

Thanks,
Chris

iimport urllib2
import HTMLParser

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
     def __init__( self ) :
         HTMLParser.__init__(self)
         self.cnt = 0

     def handle_starttag(self, tag, attr):
         # print "Encountered the beginning of a %s tag" % tag
         if (tag in "IMG" or tag in "img") :
             self.cnt = self.cnt + 1
             print tag,

     def close(self) :
         print
         print "HTMLParse Found : ", self.cnt
         HTMLParser.close(self)



mp = MyParser()
for line in urllib2.urlopen('http://www.cnn.com') :
     mp.feed(line)
mp.close()

print "Finished"



More information about the Tutor mailing list