[Twisted-Python] Problem fetching page with getPage
I've run into a problem fetching an HTTP page with t.w.client.getPage. It's not simple to make standalone code showing what's going wrong, but the following summarizes where I am and why I find this puzzling. After some setup, I have some a url path, and some headers I want to send. A summary: host = 'ec2.amazon.com' port = 443 path = '/?some=params&are=here&etc=etc' method = 'GET' data = '' headers = { 'some' : 'headers', 'Content-Length' : '0' } url = 'https://%s:%d%s' % (host, port, path) the actual details don't matter right now, I don't think. When I call d = getPage(url, headers=headers) d's errback fires with a twisted.web.error.Error with a 403 status. So you'd think I had something wrong in my headers, or was trying to access a forbidden resource, etc. But.... when I drop this code in instead of the call to getPage: import httplib cx = httplib.HTTPSConnection(host, port) cx.request(method, path, data, headers) response = cx.getresponse() print 'response status:', response.status body = response.read() print 'body:', body I get a 200 status, and the body is exactly as expected. BTW, the path above does start with a slash. I've tried using HTTPClientFactory and reactor.connectSSL directly. I've tried with and without the '' postdata and Content-Length header. I've tried with Twisted 8.2.0 and 9.0.0. And of course I've checked many times that the URL and its query params requested by httplib and getPage are identical (apart from the time-sensitive signature). The reason it's not easy to provide a simple example is that the URL and headers have signed components, based in part on a timestamp, and based in part on Amazon secret keys, etc. It's not easy to separate all that, and if I did I'd be posting at least 100 lines of code that would only run if you had your Amazon AWS details provided etc. In any case, it looks like the problem is not in the setup of the request. Can anyone offer a reason why httplib might be able to fetch the page whereas getPage receives an error? I'm stumped. Terry
On Jan 2, 2010, at 9:34 AM, Terry Jones wrote:
In any case, it looks like the problem is not in the setup of the request. Can anyone offer a reason why httplib might be able to fetch the page whereas getPage receives an error? I'm stumped.
I've had to debug things like this recently and I have two suggestions: 1> Recreate the headers and make it work with curl. Curl won't add anything to your headers and such and you'll be sure that you're getting the result you want with completely stripped down case. 2> Get Charles http://www.charlesproxy.com/ if you're on OS X. It rocks. Otherwise, get one of the Windows tools (sorry, no recos from me on that), and watch exactly what goes by. I had a situation where python's HTTPlib stuff was adding an Accept Encoding header that didn't put there, and it exposed a bug in the API I was using. When I ran it with curl, worked fine since no additional headers were added. Charles helped me see what was going on (unfortunately, long after they had fixed that particular bug in the API. S aka/Steve Steiner aka/ssteinerX
Am Samstag, den 02.01.2010, 10:03 -0500 schrieb ssteinerX@gmail.com:
On Jan 2, 2010, at 9:34 AM, Terry Jones wrote:
In any case, it looks like the problem is not in the setup of the request. Can anyone offer a reason why httplib might be able to fetch the page whereas getPage receives an error? I'm stumped.
I've had to debug things like this recently and I have two suggestions:
1> Recreate the headers and make it work with curl. Curl won't add anything to your headers and such and you'll be sure that you're getting the result you want with completely stripped down case.
2> Get Charles http://www.charlesproxy.com/ if you're on OS X. It rocks. Otherwise, get one of the Windows tools (sorry, no recos from me on that), and watch exactly what goes by.
Actually, CharlesProxy is a Java tool, AFAIK. And personally I'm really not that sure that it rocks, but personal opinions do vary :) As a free alternative, webscarab can handle the man-in-the-middle interception too. Consider also using FoxyProxy (a FF addon), to direct only the URLs you are interested into a the logging proxy. Andreas
On Jan 2, 2010, at 9:34 AM, Terry Jones wrote:
In any case, it looks like the problem is not in the setup of the request. Can anyone offer a reason why httplib might be able to fetch the page whereas getPage receives an error? I'm stumped.
Well, I know this isn't terribly helpful, but "a bug in getPage" is really the only thing that comes to mind. Or, some legal-but-unusual behavior in getPage which triggers a bug on the EC2 side of things. The only thing I can suggest is to start wireshark, do a byte-for-byte comparison of the requests that getPage and httplib emit, and see if you can find any of the differences which might be significant. I would look carefully at any place in the request or response where data is being quoted or unquoted. Based on the other stuff you've said, nothing jumps out at me.
participants (4)
-
Andreas Kostyrka
-
Glyph Lefkowitz
-
ssteinerX@gmail.com
-
Terry Jones