[Tutor] Website Retrieval Program

Daniel Watkins daniel at thewatkins.org.uk
Wed Aug 24 18:35:41 CEST 2005


I'm currently trying to write a script that will get all the files
necessary for a webpage to display correctly, followed by all the
intra-site pages and such forth, in order to try and retrieve one of the
many sites I have got jumbled up on my webspace. After starting the
writing, someone introduced me to wget, but I'm continuing this because
it seems like fun (and that statement is the first step on a slippery
slope :P).

My script thus far reads:
"""
import re
import urllib

source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
next = re.findall('src=".*html"',source.read())
print next
"""

This returns the following:
"['src="nothing_left.html"', 'src="testindex.html"',
'src="nothing_right.html"']"

This is a good start (and it took me long enough! :P), but, ideally, the
re would strip out the 'src=' as well. Does anybody with more re-fu than
me know how I could do that?

Incidentally, feel free to use that page as an example. In addition, I
am aware that this will need to be adjusted and expanded later on, but
it's a start.

Thanks in advance,
Dan



More information about the Tutor mailing list