Help: HTMLParser cannot parse some web pages?
Paul Lim
paullim at starhub.net.sg
Wed Oct 17 08:21:39 EDT 2001
Hi,
I am a newbie in Python. I hope the guru could advise me on the
following
I am trying to extract the links in html file.
My code is shown below:
The code works fine. But I just want to understand more about this
HTMLParser module. Apparently, there are some webpages where I cannot
extract the links.
But I really don't understand why? An example is
http://www.admissions.rmit.edu.au/about/index.html
Is there certain limitation in this HTMLParser? For example, is it that
it cannot extract from certain kind of web pages. If so, which kind?
Thank you very much for your help.
Sincerely
Paul
"To extract the links in a page."
# To open a url and return url handler
try:
linkHandler = urllib.urlopen(link)
except IOError:
print "Unable to open url!"
# Extract link from the HTML file and stored in anchorlist
try:
parser = HTMLParser(NullFormatter())
parser.feed(linkHandler.read())
except:
print "Unable to extract!"
pass
More information about the Python-list
mailing list