cut strings and parse for images

Andreas Volz usenet-spam-trap at
Mon Dec 6 20:34:56 CET 2004


I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = ""

class URLLister(SGMLParser):
	def reset(self):
		self.urls = []

	def start_a(self, attrs):
		href = [v for k, v in attrs if k=='href']
		if href:

if __name__ == "__main__":
	import urllib
	usock = urllib.urlopen(leach_url)
	parser = URLLister()
	for url in parser.urls: 
		print url

Perhaps you've some tips how to solve this problems?


More information about the Python-list mailing list