HTML Parser - beginner needs help

Alex Martelli aleaxit at yahoo.com
Thu Sep 14 23:51:56 CEST 2000


"zet" <zet at i.com.ua> wrote in message
news:968956212.35650 at ipt2.iptelecom.net.ua...
> Can somebody provide small piece of code, which returns list of  img tags?
> I've trying this lines:
>
> class IMGParser(HTMLParser):
>  def end_img(arg):
>   return
>
> but it return only an anchors, how to get IMG's?

The general idea:

import sgmllib

class Imgs(sgmllib.SGMLParser):
    def do_img(self, attributes):
        print attributes

getim=Imgs()
getim.feed(open("c:/mydocu~1/samba98.htm").read())
getim.close()

giving output such as:

[('height', '51'), ('src', 'Samba98_files/cllogo_medium.gif'), ('width',
'220')]
[('height', '28'), ('src', 'Samba98_files/button_home.gif'), ('width',
'28')]
[('height', '28'), ('src', 'Samba98_files/button_up.gif'), ('width', '28')]
[('height', '28'), ('src', 'Samba98_files/button_home.gif'), ('width',
'28')]
[('height', '28'), ('src', 'Samba98_files/button_up.gif'), ('width', '28')]


If what you want to do is accumulate a list of the src attributes only,
for example, the class could be:

class Imgs(sgmllib.SGMLParser):
    def __init__(self):
        self.imgs = []
    def do_img(self, attributes):
        self.imgs.append(attributes[src])

and the end result would be left in the .imgs field of the object after
.close is called (of course, you could make an accessor method for
that, if you so desire).


Alex







More information about the Python-list mailing list