[Tutor] Website Retrieval Program

Luis N tegmine at gmail.com
Wed Aug 24 20:33:18 CEST 2005


On 8/24/05, Daniel Watkins <daniel at thewatkins.org.uk> wrote:
> I'm currently trying to write a script that will get all the files
> necessary for a webpage to display correctly, followed by all the
> intra-site pages and such forth, in order to try and retrieve one of the
> many sites I have got jumbled up on my webspace. After starting the
> writing, someone introduced me to wget, but I'm continuing this because
> it seems like fun (and that statement is the first step on a slippery
> slope :P).
> 
> My script thus far reads:
> """
> import re
> import urllib
> 
> source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
> next = re.findall('src=".*html"',source.read())
> print next
> """
> 
> This returns the following:
> "['src="nothing_left.html"', 'src="testindex.html"',
> 'src="nothing_right.html"']"
> 
> This is a good start (and it took me long enough! :P), but, ideally, the
> re would strip out the 'src=' as well. Does anybody with more re-fu than
> me know how I could do that?
> 
> Incidentally, feel free to use that page as an example. In addition, I
> am aware that this will need to be adjusted and expanded later on, but
> it's a start.
> 
> Thanks in advance,
> Dan
> 

You may wish to have a look at the Beautiful Soup module,
http://www.crummy.com/software/BeautifulSoup/

Luis.


More information about the Tutor mailing list