[Tutor] Website Retrieval Program

Orri Ganel singingxduck at gmail.com
Wed Aug 24 18:51:30 CEST 2005


Daniel Watkins wrote:

>I'm currently trying to write a script that will get all the files
>necessary for a webpage to display correctly, followed by all the
>intra-site pages and such forth, in order to try and retrieve one of the
>many sites I have got jumbled up on my webspace. After starting the
>writing, someone introduced me to wget, but I'm continuing this because
>it seems like fun (and that statement is the first step on a slippery
>slope :P).
>
>My script thus far reads:
>"""
>import re
>import urllib
>
>source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
>next = re.findall('src=".*html"',source.read())
>print next
>"""
>
>This returns the following:
>"['src="nothing_left.html"', 'src="testindex.html"',
>'src="nothing_right.html"']"
>
>This is a good start (and it took me long enough! :P), but, ideally, the
>re would strip out the 'src=' as well. Does anybody with more re-fu than
>me know how I could do that?
>
>Incidentally, feel free to use that page as an example. In addition, I
>am aware that this will need to be adjusted and expanded later on, but
>it's a start.
>
>Thanks in advance,
>Dan
>
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor
>
>  
>
Well, you don't necessarily need re-fu to do that:

import re
import urllib

source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
next = [i[4:] for i in re.findall('src=".*html"',source.read())]
print next

gives you

['"nothing_left.html"', '"testindex.html"', '"nothing_right.html"']

And if you wanted it to specifically take out "src=" only, I'm sure you 
could tailor it to do something with i[i.index(...):] instead.

-- 
Email: singingxduck AT gmail DOT com
AIM: singingxduck
Programming Python for the fun of it.



More information about the Tutor mailing list