I'm attempting to get some web pages using the following code which I did not write. While it seems to work (except for this, so far) I have no idea if this is a reasonable way to do this (get simple web pages) at all:
def getPage(url, contextFactory=None, *args, **kwargs): """ Download a web page as a string.
Download a page. Return a deferred, which will callback with a page (as a string) or errback with a description of the error.
See HTTPClientFactory to see what extra args can be passed. """ scheme, host, port, path = parse_url(url) factory = HTTPClientFactory(url, *args, **kwargs) if scheme == 'https': from twisted.internet import ssl if contextFactory is None: contextFactory = ssl.ClientContextFactory() reactor.connectSSL(host, port, factory, contextFactory) else: reactor.connectTCP(host, port, factory)
The code then adds a bunch of callbacks to the returned deferred to do various things to the data and everything's swell.
Until the url shown below occurs. The deferred never calls any of the callbacks and just never seems to finish.
I haven't found any way to dump the actual headers from within Twisted as this occurs so the header values shown below are from firefox calling into the same URL. I will put tcpdump in the way if I need to to figure this out but I'm thinking this is something simple (or wrong with the method used in the code above).
Can anyone tell me what it is about this particular transaction that's not allowing the deferred to fire its callbacks which I presume is because it never finishes getting the stuff it's looking for. This particular URL returns a .vcf file.
Also, what is the proper intervention? I'd like not to download the .vcf as it's completely useless for my purpose but I'm not familiar enough with twisted.web to know where to intervene.
GET /index.php? option=com_contact&task=vcard&contact_id=1&format=raw&tmpl=component HTTP/1.1 Host: www.integrateddevcorp.com User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv: 22.214.171.124) Gecko/20090824 Firefox/3.5.3 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive
HTTP/1.x 200 OK Date: Thu, 08 Oct 2009 21:14:37 GMT Server: Apache X-Powered-By: PHP/5.2.8 Set-Cookie: ff70eb7218d444fa639af7ae7e66e82f=488606e54b7fdd9affb0b0725a2a6607; path=/ P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" Content-Disposition: attachment; filename=Integrated_Development_Corporation.vcf Content-Length: 1020 Connection: close Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- check=0 Pragma: no-cache Expires: Mon, 1 Jan 2001 00:00:00 GMT Last-Modified: Thu, 08 Oct 2009 21:14:37 GMT Content-Type: text/html; charset=utf-8 ----------------------------------------------------------