HTMLparsing abnormal html pages

asle at spam.com asle at spam.com
Thu Mar 15 03:50:41 EST 2001


Considering the small program below. Running it will show that the
HTMLparser
is truncating urls in the HTML page. Now, most of you will probably say that
the page and in particular the URL's of this page are not valid according to
the RFC1738 protocol --bad luck. But there must be a work-around for this?

import htmllib
import urllib
import formatter

url='http://di.se/Scripts/Sections/allarticles.asp'
parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(url).read())
parser.close()
urls=parser.anchorlist
print urls

One solution is of course to preprosess the whole HTML page and
replacing invalid URL's whith valid URL's (using regex??), however I have
also tried to look into HTMLparser and the formatter to see what can be done
there to correct the problem, but with no sucess on the latter.

Any comments on what to do?

/Asle



 -----  Posted via NewsOne.Net: Free (anonymous) Usenet News via the Web  -----
  http://newsone.net/ -- Free reading and anonymous posting to 60,000+ groups
   NewsOne.Net prohibits users from posting spam.  If this or other posts
made through NewsOne.Net violate posting guidelines, email abuse at newsone.net



More information about the Python-list mailing list