[Tutor] Accessing the Web using Python

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue Feb 11 22:05:03 2003


On Fri, 23 Aug 2002, Henry Steigerwaldt wrote:


> import urllib
>
> fwcURL = "http://isl715.nws.noaa.gov/tdl/forecast/fwc.txt"
>
> try:
>    print "Going to Web for data"
>    fwcall = urllib.urlopen(fwcURL).read()
>    print "Successful"
>    print "Will now print all of the data to screen"
>    print "fwcall = ", fwcall
> except:
>    print "Could not obtain data from Web"
> ______________________________________________
> Using this code the previous day, I had absolutely no problem
> getting the data. However yesterday using the same code,
> most of the time this site could not be accessed at all.

[some text cut]

> I also noticed that a few times I would be able to access the site (i.e.
> the "Successful" would print to the screen), but each time absolutely
> NOTHING would be stored in the "fwcall" variable, unlike successful
> times when all the text information WAS stored in the variable.



Hi Henry,


We may need some more information; at the moment, the code is obscuring
some information that exceptions can provide.  Let's enable some more
diagnostics.  Can you change the except block to something like:

###
except:
    print "Could not obtain data from Web"
    traceback.print_exc()
###

You'll probably need to import the 'traceback' module for this.  The
additional line, that "traceback.print_exc()", will print out more
information about the excetion itself, and should give us insight into
what exactly is causing the magic to fizzle.




> I just tried this same code tonight and once again it works great! I am
> really puzzled by all this. When one writes a program to access the Web,
> as long as the site accessed is not "down," one should anticipate always
> being able to get the data.


It actually depends on the service that the web site provides!  For
example, the National Center for Biotechnology Information (NCBI) provides
a set of valuable online programs and services for biologists:

    http://www.ncbi.nlm.nih.gov/


But, despite the electronic nature of NCBI, there is a kind of scarcity
involved here: namely, they need to maintain a service that's available to
scientists in a timely fashion, and some of the services they provide are
computationally very expensive.  What to do?


NCBI has a cap, a kind of rate limiter, that limits how many requests they
handle from a single computer at a time.  That is, NCBI will block web
requests of anyone who tries to abuse their public resource.  As an
example, here's what their guidelines dictate:


"""
    Do not overload NCBI's systems. Users intending to send numerous
    queries and/or retrieve large numbers of records from Entrez should
    comply with the following:

    * Run retrieval scripts on weekends or between 9 PM and 5 AM ET
      weekdays for any series of more than 100 requests.

    * Make no more than one request every 3 seconds.
"""


And they are serious.  I accidently ran a program once that hammered their
systems.  It is not a Good Thing when your computer is blacklisted from a
national public resource.  *cough*



But that's NCBI; I don't know if the National Weather Service applies a
similar rate-limiter on their services.  So let's see what the
traceback.print_exc() gives us in your program above, and we'll work from
there.



Good luck!