[Tutor] traversing page and then following the link

vince spicer vinces1979 at gmail.com
Fri Jul 24 17:23:06 CEST 2009


there are many ways to parse html pages and retrieve data, I tend to use
lxml and xpath to simplify things and urllib to pull down the data

lxml is not a core library but can be installed via easy_install, the main
benefit is the xpath support

=====================================================================================
import urllib2
from lxml import html as HTML

useragent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1)
Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1"

opener = urllib2.build_opener()

main_page = urllib2.Request("
http://en.wikipedia.org/wiki/Gallery_of_sovereign-state_flags")
main_page.headers["User-Agent"] = useragent #: wikipedia will return a 403
otherwise

html = HTML.fromstring(opener.open(main_page).read())

images = html.xpath("//a[@class='image']/img") #: parse all image tags

for img in images:
    url = img.attrib["src"].split("/")
    url.pop(url.index("thumb")) #: remove reference to thumb folder

    local = open(url[-2], "wb") #: open local file

    url = "/".join(url[:-1]) #: cleanup the .png reference

    print "Downloading %s" % url

    imgreq = urllib2.Request(url)
    imgreq.headers["User-Agent"] = useragent

    local.write(opener.open(imgreq).read()) #: read the remote svg file into
local file
    local.close()

============================================================================================

On Fri, Jul 24, 2009 at 8:48 AM, <davidwilson at safe-mail.net> wrote:

> Hello,
> I would like to download all the flags from the
> http://en.wikipedia.org/wiki/Gallery_of_sovereign-state_flags so that I
> can create a flags sprite from this.
>
> The flags seem to follow some order in that all the svg files are in the
> following pattern:
>
> http://en.wikipedia.org/wiki/File:Flag_of_*.svg and then on this page
> there is the link of the file.
>
> I have looked at using Twill to follow each link and record the actual url,
> but can somone point me at a simpler solution.
>
> Dave
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090724/6764a122/attachment.htm>


More information about the Tutor mailing list