[Tutor] can I walk or glob a website?
Albert-Jan Roskam
fomcl at yahoo.com
Wed May 18 19:32:36 CEST 2011
Hi Steven,
From: Steven D'Aprano <steve at pearwood.info>
To: tutor at python.org
Sent: Wed, May 18, 2011 1:13:17 PM
Subject: Re: [Tutor] can I walk or glob a website?
On Wed, 18 May 2011 07:06:07 pm Albert-Jan Roskam wrote:
> Hello,
>
> How can I walk (as in os.walk) or glob a website?
If you're on Linux, use wget or curl.
===> Thanks for your reply. I tried wget, which seems to be a very handy tool.
However, it doesn't work on this particular site. I tried wget -e robots=off -r
-nc --no-parent -l6 -A.pdf 'http://www.landelijkregisterkinderopvang.nl/' (the
quotes are there because I originally used a deeper link that contains
ampersands). I also tested it on python.org, where it does work. Adding -e
robots=off didn't work either. Do you think this could be a protection from the
administrator?
If you're on Mac, you can probably install them using MacPorts.
If you're on Windows, you have my sympathies.
*wink*
> I want to download
> all the pdfs from a website (using urllib.urlretrieve),
This first part is essentially duplicating wget or curl. The basic
algorithm is:
- download a web page
- analyze that page for links
(such <a href=...> but possibly also others)
- decide whether you should follow each link and download that page
- repeat until there's nothing left to download, the website blocks
your IP address, or you've got everything you want
except wget and curl already do 90% of the work.
If the webpage requires Javascript to make things work, wget or curl
can't help. I believe there is a Python library called Mechanize to
help with that. For dealing with real-world HTML (also known
as "broken" or "completely f***ed" HTML, please excuse the
self-censorship), the library BeautifulSoup may be useful.
Before doing any mass downloading, please read this:
http://lethain.com/an-introduction-to-compassionate-screenscraping/
> extract
> certain figures (using pypdf- is this flexible enough?) and make some
> statistics/graphs from those figures (using rpy and R). I forgot what
> the process of 'automatically downloading' is called again, something
> that sounds like 'whacking' (??)
Sometimes called screen or web scraping, recursive downloading, or
copyright-infringement *wink*
http://en.wikipedia.org/wiki/Web_scraping
--
Steven D'Aprano
_______________________________________________
Tutor maillist - Tutor at python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110518/4b6c2954/attachment.html>
More information about the Tutor
mailing list