[Tutor] Accessing the Web using Python
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Tue Feb 11 22:05:03 2003
On Fri, 23 Aug 2002, Henry Steigerwaldt wrote:
> import urllib
>
> fwcURL = "http://isl715.nws.noaa.gov/tdl/forecast/fwc.txt"
>
> try:
> print "Going to Web for data"
> fwcall = urllib.urlopen(fwcURL).read()
> print "Successful"
> print "Will now print all of the data to screen"
> print "fwcall = ", fwcall
> except:
> print "Could not obtain data from Web"
> ______________________________________________
> Using this code the previous day, I had absolutely no problem
> getting the data. However yesterday using the same code,
> most of the time this site could not be accessed at all.
[some text cut]
> I also noticed that a few times I would be able to access the site (i.e.
> the "Successful" would print to the screen), but each time absolutely
> NOTHING would be stored in the "fwcall" variable, unlike successful
> times when all the text information WAS stored in the variable.
Hi Henry,
We may need some more information; at the moment, the code is obscuring
some information that exceptions can provide. Let's enable some more
diagnostics. Can you change the except block to something like:
###
except:
print "Could not obtain data from Web"
traceback.print_exc()
###
You'll probably need to import the 'traceback' module for this. The
additional line, that "traceback.print_exc()", will print out more
information about the excetion itself, and should give us insight into
what exactly is causing the magic to fizzle.
> I just tried this same code tonight and once again it works great! I am
> really puzzled by all this. When one writes a program to access the Web,
> as long as the site accessed is not "down," one should anticipate always
> being able to get the data.
It actually depends on the service that the web site provides! For
example, the National Center for Biotechnology Information (NCBI) provides
a set of valuable online programs and services for biologists:
http://www.ncbi.nlm.nih.gov/
But, despite the electronic nature of NCBI, there is a kind of scarcity
involved here: namely, they need to maintain a service that's available to
scientists in a timely fashion, and some of the services they provide are
computationally very expensive. What to do?
NCBI has a cap, a kind of rate limiter, that limits how many requests they
handle from a single computer at a time. That is, NCBI will block web
requests of anyone who tries to abuse their public resource. As an
example, here's what their guidelines dictate:
"""
Do not overload NCBI's systems. Users intending to send numerous
queries and/or retrieve large numbers of records from Entrez should
comply with the following:
* Run retrieval scripts on weekends or between 9 PM and 5 AM ET
weekdays for any series of more than 100 requests.
* Make no more than one request every 3 seconds.
"""
And they are serious. I accidently ran a program once that hammered their
systems. It is not a Good Thing when your computer is blacklisted from a
national public resource. *cough*
But that's NCBI; I don't know if the National Weather Service applies a
similar rate-limiter on their services. So let's see what the
traceback.print_exc() gives us in your program above, and we'll work from
there.
Good luck!