[Tutor] traversing page and then following the link
vince spicer
vinces1979 at gmail.com
Fri Jul 24 17:23:06 CEST 2009
there are many ways to parse html pages and retrieve data, I tend to use
lxml and xpath to simplify things and urllib to pull down the data
lxml is not a core library but can be installed via easy_install, the main
benefit is the xpath support
=====================================================================================
import urllib2
from lxml import html as HTML
useragent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1)
Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1"
opener = urllib2.build_opener()
main_page = urllib2.Request("
http://en.wikipedia.org/wiki/Gallery_of_sovereign-state_flags")
main_page.headers["User-Agent"] = useragent #: wikipedia will return a 403
otherwise
html = HTML.fromstring(opener.open(main_page).read())
images = html.xpath("//a[@class='image']/img") #: parse all image tags
for img in images:
url = img.attrib["src"].split("/")
url.pop(url.index("thumb")) #: remove reference to thumb folder
local = open(url[-2], "wb") #: open local file
url = "/".join(url[:-1]) #: cleanup the .png reference
print "Downloading %s" % url
imgreq = urllib2.Request(url)
imgreq.headers["User-Agent"] = useragent
local.write(opener.open(imgreq).read()) #: read the remote svg file into
local file
local.close()
============================================================================================
On Fri, Jul 24, 2009 at 8:48 AM, <davidwilson at safe-mail.net> wrote:
> Hello,
> I would like to download all the flags from the
> http://en.wikipedia.org/wiki/Gallery_of_sovereign-state_flags so that I
> can create a flags sprite from this.
>
> The flags seem to follow some order in that all the svg files are in the
> following pattern:
>
> http://en.wikipedia.org/wiki/File:Flag_of_*.svg and then on this page
> there is the link of the file.
>
> I have looked at using Twill to follow each link and record the actual url,
> but can somone point me at a simpler solution.
>
> Dave
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090724/6764a122/attachment.htm>
More information about the Tutor
mailing list