HTMLparsing abnormal html pages

asle at asle at
Thu Mar 15 09:50:41 CET 2001

Considering the small program below. Running it will show that the
is truncating urls in the HTML page. Now, most of you will probably say that
the page and in particular the URL's of this page are not valid according to
the RFC1738 protocol --bad luck. But there must be a work-around for this?

import htmllib
import urllib
import formatter

print urls

One solution is of course to preprosess the whole HTML page and
replacing invalid URL's whith valid URL's (using regex??), however I have
also tried to look into HTMLparser and the formatter to see what can be done
there to correct the problem, but with no sucess on the latter.

Any comments on what to do?


 -----  Posted via NewsOne.Net: Free (anonymous) Usenet News via the Web  ----- -- Free reading and anonymous posting to 60,000+ groups
   NewsOne.Net prohibits users from posting spam.  If this or other posts
made through NewsOne.Net violate posting guidelines, email abuse at

More information about the Python-list mailing list