[Tutor] Accessing a Website

Fri Jul 13 02:27:23 CEST 2012

Fred G wrote:

> With the exception of step 6, I'm not quite sure how to do this in Python.
>  Is it very complicated to write a script that logs onto a website that
> requires a user name and password that I have, and then repeatedly enters
> names and gets their associated id's that we want?

Python comes with some libraries for downloading web resources, including web 
pages. But if you have to interactive with the web page, such as entering 
names into a search field, your best bet is the third-party library mechanize. 
I have never used it, but I have never heard anything but good things about it.

http://pypi.python.org/pypi/mechanize/

http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/index.html

> I used to work at a
> cancer lab where we decided we couldn't do this kind of thing to search
> PubMed, and that a human would be more accurate even though our criteria
> was simply (is there survival data?).  I don't think that this has to be
> the case here, but would greatly appreciate any guidance.

In general, web-scraping is fraught with problems. Web sites are written by 
ignorant code-monkeys who can barely spell HTML, or worse, too-clever-by-far 
web designers who write too-clever, subtly broken code that only works with 
Internet Explorer (and if you are lucky, Firefox). Or they stick everything in 
Javascript, or worse, Flash.

And often the web server tries to prevent automated tools from fetching 
information. Or there may be legal barriers, where something which is 
perfectly legal if *you* do it becomes (allegedly) illegal if an automated 
script does it.

So I can perfectly understand why a conservative, risk-adverse university 
might prefer to have a human being mechanically fetch the data.

And yet, with work it is possible to code around nearly all these issues. 
Using tools like mechanize and BeautifulSoup, faking the user-agent string, 
and a few other techniques, most non-Flash non-Javascript sites can be 
successfully web-scraped. Even the legal issue can be coded around by adding 
some human interaction to the script, so that it is not an *automated* script, 
while still keeping most of the benefits of automated scraping.

Don't abuse the privilege:

- obey robots.txt
- obey the site's terms and conditions
- obey copyright law
- make a temporary cache of pages you need to re-visit[1]
- give real human visitors priority
- limit your download rate to something reasonable
- pause between requests so you aren't hitting the server at an
   unreasonable rate
- in general, don't be a dick and disrupt the normal working of the
   server or website with your script.

[1]  I am aware of the irony that this is theoretically forbidden by 
copyright. Nevertheless, it is the right thing to do, both technically and 
ethically.

-- 
Steven