[Tutor] can I walk or glob a website?

Wed May 18 19:32:36 CEST 2011

Hi Steven,

From: Steven D'Aprano <steve at pearwood.info>

To: tutor at python.org
Sent: Wed, May 18, 2011 1:13:17 PM
Subject: Re: [Tutor] can I walk or glob a website?

On Wed, 18 May 2011 07:06:07 pm Albert-Jan Roskam wrote:
> Hello,
>
> How can I walk (as in os.walk) or glob a website? 

If you're on Linux, use wget or curl.

===> Thanks for your reply. I tried wget, which seems to be a very handy tool. 
However, it doesn't work on this particular site. I tried wget -e robots=off -r 
-nc --no-parent -l6 -A.pdf 'http://www.landelijkregisterkinderopvang.nl/' (the 
quotes are there because I originally used a deeper link that contains 
ampersands). I also tested it on python.org, where it does work. Adding -e 
robots=off didn't work either. Do you think this could be a protection from the 
administrator?

If you're on Mac, you can probably install them using MacPorts.

If you're on Windows, you have my sympathies.

*wink*

> I want to download 
> all the pdfs from a website (using urllib.urlretrieve), 

This first part is essentially duplicating wget or curl. The basic 
algorithm is:

- download a web page
- analyze that page for links 
  (such <a href=...> but possibly also others)
- decide whether you should follow each link and download that page
- repeat until there's nothing left to download, the website blocks 
  your IP address, or you've got everything you want

except wget and curl already do 90% of the work.

If the webpage requires Javascript to make things work, wget or curl 
can't help. I believe there is a Python library called Mechanize to 
help with that. For dealing with real-world HTML (also known 
as "broken" or "completely f***ed" HTML, please excuse the 
self-censorship), the library BeautifulSoup may be useful.

Before doing any mass downloading, please read this:

http://lethain.com/an-introduction-to-compassionate-screenscraping/

> extract 
> certain figures (using pypdf- is this flexible enough?) and make some
> statistics/graphs from those figures (using rpy and R). I forgot what
> the process of 'automatically downloading' is called again, something
> that sounds like 'whacking' (??)

Sometimes called screen or web scraping, recursive downloading, or 
copyright-infringement *wink*

http://en.wikipedia.org/wiki/Web_scraping

-- 
Steven D'Aprano
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110518/4b6c2954/attachment.html>