Parsing an HTML a tag
George Sakkis
gsakkis at
Sat Sep 24 21:06:48 EDT 2005
"George" <buffer_88 at> wrote:
> I'm very new to python and I have tried to read the tutorials but I am
> unable to understand exactly how I must do this problem.
> Specifically, the showIPnums function takes a URL as input, calls the
> read_page(url) function to obtain the entire page for that URL, and
> then lists, in sorted order, the IP addresses implied in the "<A
> HREF=· · ·>" tags within that page.
> """
> Module to print IP addresses of tags in web file containing HTML
> >>> showIPnums('')
> ['', '', '']
> >>> showIPnums('')
> ['', '', '', '',
> '', '', '',
> '', '', '']
> """
> def read_page(url):
> import formatter
> import htmllib
> import urllib
> htmlp = htmllib.HTMLParser(formatter.NullFormatter())
> htmlp.feed(urllib.urlopen(url).read())
> htmlp.close()
> def showIPnums(URL):
> page=read_page(URL)
> if __name__ == '__main__':
> import doctest, sys
> doctest.testmod(sys.modules[__name__])
You forgot to mention that you don't want duplicates in the result. Here's a function that passes
the doctest:
from urllib import urlopen
from urlparse import urlsplit
from socket import gethostbyname
from BeautifulSoup import BeautifulSoup
def showIPnums(url):
"""Return the unique IPs found in the anchors of the webpage at the given
>>> showIPnums('')
['', '', '']
>>> showIPnums('')
['', '', '', '', '',
'', '', '', '', '']
hrefs = set()
for link in BeautifulSoup(urlopen(url)).fetch('a'):
try: hrefs.add(gethostbyname(urlsplit(link["href"])[1]))
except: pass
return sorted(hrefs)
More information about the Python-list
mailing list