[Tutor] Strategy to read a redirecting html page
Robert Sjoblom
robert.sjoblom at gmail.com
Thu Jun 2 22:06:05 CEST 2011
> When you hit the page and you get an HTTP redirect code back (say,
> 302), you will need to make another call to the URL specified in the
> "Location" parameter in the response headers. Then you retrieve that
> new page and you can check you got an acceptable HTTP response code
> (such as 200) and read the page's body (or whatever you want to do
> with it). Otherwise, keep looping until you get an expected HTTP
> response code.
>
> Note: you may get stuck in an infinite loop if two URLs redirect to each other.
>
> You might want to take a look at the higher level httplib module:
> http://docs.python.org/library/httplib.html
>
> Although I don't think it can automatically follow redirects for you.
> You'll have to implement the loop yourself.
>
> If you can rely on 3rd party packages (not part of the standard Python
> library), take a look at httplib2:
> https://httplib2.googlecode.com/hg/doc/html/libhttplib2.html
>
> This one can follow redirects.
>
> HTH,
Sorry for bringing up an old topic like this, but writing longer
messages on a phone is just not something that I want to do.
Python already has the urllib/urllib2 package that automatically
follow redirects, so I don't see why you'd need a 3rd-party module to
deal with it? When it encounters a 301 status code from the server,
urllib2 will search through its handlers and call the http_error_301
method, which will look for the Location: header and follow that
address. The behaviour is defined in HTTPRedirectHandler, which can be
overridden if necessary:
>>> help(urllib.request.HTTPRedirectHandler)
Help on class HTTPRedirectHandler in module urllib.request:
class HTTPRedirectHandler(BaseHandler)
| Method resolution order:
| HTTPRedirectHandler
| BaseHandler
| builtins.object
|
| Methods defined here:
|
| http_error_301 = http_error_302(self, req, fp, code, msg, headers)
|
| http_error_302(self, req, fp, code, msg, headers)
| # Implementation note: To avoid the server sending us into an
| # infinite loop, the request object needs to track what URLs we
| # have already seen. Do this by adding a handler-specific
| # attribute to the Request object.
|
| http_error_303 = http_error_302(self, req, fp, code, msg, headers)
|
| http_error_307 = http_error_302(self, req, fp, code, msg, headers)
|
| redirect_request(self, req, fp, code, msg, headers, newurl)
| Return a Request or None in response to a redirect.
|
| This is called by the http_error_30x methods when a
| redirection response is received. If a redirection should
| take place, return a new Request to allow http_error_30x to
| perform the redirect. Otherwise, raise HTTPError if no-one
| else should try to handle this url. Return None if you can't
| but another Handler might.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| inf_msg = 'The HTTP server returned a redirect error that w...n infini...
|
| max_redirections = 10
|
| max_repeats = 4
|
| ----------------------------------------------------------------------
| Methods inherited from BaseHandler:
|
| __lt__(self, other)
|
| add_parent(self, parent)
|
| close(self)
|
| ----------------------------------------------------------------------
| Data descriptors inherited from BaseHandler:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from BaseHandler:
|
| handler_order = 500
best regards,
Robert S.
More information about the Tutor
mailing list