[Tutor] Accessing a Website
Steven D'Aprano
steve at pearwood.info
Fri Jul 13 02:27:23 CEST 2012
Fred G wrote:
> With the exception of step 6, I'm not quite sure how to do this in Python.
> Is it very complicated to write a script that logs onto a website that
> requires a user name and password that I have, and then repeatedly enters
> names and gets their associated id's that we want?
Python comes with some libraries for downloading web resources, including web
pages. But if you have to interactive with the web page, such as entering
names into a search field, your best bet is the third-party library mechanize.
I have never used it, but I have never heard anything but good things about it.
http://pypi.python.org/pypi/mechanize/
http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/index.html
> I used to work at a
> cancer lab where we decided we couldn't do this kind of thing to search
> PubMed, and that a human would be more accurate even though our criteria
> was simply (is there survival data?). I don't think that this has to be
> the case here, but would greatly appreciate any guidance.
In general, web-scraping is fraught with problems. Web sites are written by
ignorant code-monkeys who can barely spell HTML, or worse, too-clever-by-far
web designers who write too-clever, subtly broken code that only works with
Internet Explorer (and if you are lucky, Firefox). Or they stick everything in
Javascript, or worse, Flash.
And often the web server tries to prevent automated tools from fetching
information. Or there may be legal barriers, where something which is
perfectly legal if *you* do it becomes (allegedly) illegal if an automated
script does it.
So I can perfectly understand why a conservative, risk-adverse university
might prefer to have a human being mechanically fetch the data.
And yet, with work it is possible to code around nearly all these issues.
Using tools like mechanize and BeautifulSoup, faking the user-agent string,
and a few other techniques, most non-Flash non-Javascript sites can be
successfully web-scraped. Even the legal issue can be coded around by adding
some human interaction to the script, so that it is not an *automated* script,
while still keeping most of the benefits of automated scraping.
Don't abuse the privilege:
- obey robots.txt
- obey the site's terms and conditions
- obey copyright law
- make a temporary cache of pages you need to re-visit[1]
- give real human visitors priority
- limit your download rate to something reasonable
- pause between requests so you aren't hitting the server at an
unreasonable rate
- in general, don't be a dick and disrupt the normal working of the
server or website with your script.
[1] I am aware of the irony that this is theoretically forbidden by
copyright. Nevertheless, it is the right thing to do, both technically and
ethically.
--
Steven
More information about the Tutor
mailing list