[Tutor] Strategy to read a redirecting html page

Thu Jun 2 22:06:05 CEST 2011

> When you hit the page and you get an HTTP redirect code back (say,
> 302), you will need to make another call to the URL specified in the
> "Location" parameter in the response headers. Then you retrieve that
> new page and you can check you got an acceptable HTTP response code
> (such as 200) and read the page's body (or whatever you want to do
> with it). Otherwise, keep looping until you get an expected HTTP
> response code.
>
> Note: you may get stuck in an infinite loop if two URLs redirect to each other.
>
> You might want to take a look at the higher level httplib module:
> http://docs.python.org/library/httplib.html
>
> Although I don't think it can automatically follow redirects for you.
> You'll have to implement the loop yourself.
>
> If you can rely on 3rd party packages (not part of the standard Python
> library), take a look at httplib2:
> https://httplib2.googlecode.com/hg/doc/html/libhttplib2.html
>
> This one can follow redirects.
>
> HTH,

Sorry for bringing up an old topic like this, but writing longer
messages on a phone is just not something that I want to do.

Python already has the urllib/urllib2 package that automatically
follow redirects, so I don't see why you'd need a 3rd-party module to
deal with it? When it encounters a 301 status code from the server,
urllib2 will search through its handlers and call the http_error_301
method, which will look for the Location: header and follow that
address. The behaviour is defined in HTTPRedirectHandler, which can be
overridden if necessary:

>>> help(urllib.request.HTTPRedirectHandler)
Help on class HTTPRedirectHandler in module urllib.request:

class HTTPRedirectHandler(BaseHandler)
 |  Method resolution order:
 |      HTTPRedirectHandler
 |      BaseHandler
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  http_error_301 = http_error_302(self, req, fp, code, msg, headers)
 |
 |  http_error_302(self, req, fp, code, msg, headers)
 |      # Implementation note: To avoid the server sending us into an
 |      # infinite loop, the request object needs to track what URLs we
 |      # have already seen.  Do this by adding a handler-specific
 |      # attribute to the Request object.
 |
 |  http_error_303 = http_error_302(self, req, fp, code, msg, headers)
 |
 |  http_error_307 = http_error_302(self, req, fp, code, msg, headers)
 |
 |  redirect_request(self, req, fp, code, msg, headers, newurl)
 |      Return a Request or None in response to a redirect.
 |
 |      This is called by the http_error_30x methods when a
 |      redirection response is received.  If a redirection should
 |      take place, return a new Request to allow http_error_30x to
 |      perform the redirect.  Otherwise, raise HTTPError if no-one
 |      else should try to handle this url.  Return None if you can't
 |      but another Handler might.
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  inf_msg = 'The HTTP server returned a redirect error that w...n infini...
 |
 |  max_redirections = 10
 |
 |  max_repeats = 4
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseHandler:
 |
 |  __lt__(self, other)
 |
 |  add_parent(self, parent)
 |
 |  close(self)
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from BaseHandler:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from BaseHandler:
 |
 |  handler_order = 500

best regards,
Robert S.