adding etag and modified arguments to twisted feedparser
![](https://secure.gravatar.com/avatar/dd401605e0c995a6e372c6bde2050a3d.jpg?s=120&d=mm&r=g)
hi, I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099) I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation. Anyway, the series of deferred functions that I think are relevant are: 1) def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT) 2) def parseFeed(self, feed): parsed = feedparser.parse(cStringIO.StringIO(feed)) The problem is that getPage() requests the entire rss feed, and then passes the stream through to feedparser.parse. Normally however, feedparser.parse() takes furthers arguments of 'etag' and 'modified' so that only new feed information is returned, thereby saving bandwidth. I tried modifying getPage() to return feedparser.parse(args), and removing the need for parseFeed(), but it runs substantially slower than the original method, I presume in a synchronous manner. Any assistance in helping to restore the impressive parallel downloading performance, but with the the datetime arguments included, would be greatly appreciated. many thanks, Selwyn
![](https://secure.gravatar.com/avatar/15fa47f2847592672210af8a25cd1f34.jpg?s=120&d=mm&r=g)
On Sep 28, 2004, at 5:53 PM, Selwyn McCracken wrote:
I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)
I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation.
Not particularly deferred related, more t.w.client related. I assume what's happening is that feedparser.parse() can either take a URL or a file-like-object. If it takes a URL, it uses its internal HTTP getting method, which is synchronous. Twisted's HTTP client is asynchronous, so you want to use that. So what you need to know how to do is send the etag/modified information to Twisted's HTTP client. You want something like: def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT, headers={'If-None-Match': '"xyzzy"', 'If-Modified-Since': 'Sun, 09 Sep 2001 01:46:40 GMT'}) However, client.getPage doesn't leave you with any way to get at the response headers (so you can save the etag and last modified responses for the next request), so you'll need to use HTTPClientFactory directly (cribbing from the code in client.getPage). Basically, after the deferred fires, factory.response_headers will have the data you want, so you just need to keep a reference to factory around. James
![](https://secure.gravatar.com/avatar/dd401605e0c995a6e372c6bde2050a3d.jpg?s=120&d=mm&r=g)
thanks James, I'll look into that. Hopefully it wont result in too many more questions ;-) James Y Knight wrote:
On Sep 28, 2004, at 5:53 PM, Selwyn McCracken wrote:
I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)
I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation.
Not particularly deferred related, more t.w.client related. I assume what's happening is that feedparser.parse() can either take a URL or a file-like-object. If it takes a URL, it uses its internal HTTP getting method, which is synchronous. Twisted's HTTP client is asynchronous, so you want to use that.
So what you need to know how to do is send the etag/modified information to Twisted's HTTP client.
You want something like:
def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT, headers={'If-None-Match': '"xyzzy"', 'If-Modified-Since': 'Sun, 09 Sep 2001 01:46:40 GMT'})
However, client.getPage doesn't leave you with any way to get at the response headers (so you can save the etag and last modified responses for the next request), so you'll need to use HTTPClientFactory directly (cribbing from the code in client.getPage). Basically, after the deferred fires, factory.response_headers will have the data you want, so you just need to keep a reference to factory around.
James
_______________________________________________ Twisted-web mailing list Twisted-web@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web
![](https://secure.gravatar.com/avatar/77d03707d9cd7b54103dca504be78ed3.jpg?s=120&d=mm&r=g)
On Wed, 29 Sep 2004 09:53:31 +1200, Selwyn McCracken <selwyn.mccracken@stonebow.otago.ac.nz> wrote:
hi,
I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)
Glad to see someone found that stuff useful :). I'm the author of that recipe. You will also be glad to know that I already solved your problems months ago and you can find the solution here: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue612 Unfortunately, when I asked for addition in twisted.web Itamar rejected because twisted.web already was on its way to deprecation. It should work without any problems, and you should use client.getPageCached() instead of client.getPage(). -- Valentino Volonghi aka Dialtone Linux User #310274, Proud Gentoo User Blog: http://vvolonghi.blogspot.com Home Page: http://xoomer.virgilio.it/dialtone/
![](https://secure.gravatar.com/avatar/dd401605e0c995a6e372c6bde2050a3d.jpg?s=120&d=mm&r=g)
many thanks for both your answer and your recipe (it is very useful). I will try and get things working with getPageCached() Valentino Volonghi wrote:
On Wed, 29 Sep 2004 09:53:31 +1200, Selwyn McCracken <selwyn.mccracken@stonebow.otago.ac.nz> wrote:
hi,
I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)
Glad to see someone found that stuff useful :). I'm the author of that recipe.
You will also be glad to know that I already solved your problems months ago and you can find the solution here: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue612
Unfortunately, when I asked for addition in twisted.web Itamar rejected because twisted.web already was on its way to deprecation.
It should work without any problems, and you should use client.getPageCached() instead of client.getPage().
![](https://secure.gravatar.com/avatar/dd401605e0c995a6e372c6bde2050a3d.jpg?s=120&d=mm&r=g)
Hi Valentino, sorry for bugging you again. I have had a look through httpcache.py, but as a total newcomer to twisted I am slightly overwhelmed by the concepts of factories and protocols at this stage. I tried modifying getPage() from your recipe to include httpcache, so that I could see what was happening, like so: def getPage(self, data, args): return httpcache.getPageCached(args,timeout=TIMEOUT) however this triggers the following error: "global name '_parse' is not defined", and I'm not sure how to proceed. In any case, what I would ideally like is something like: def getPage(self, data, args): return httpcache.getPageCached(args,timeout=TIMEOUT, etag=_ETAG,modified=_MODIFIED) This would simply return the page in full or the 304 error, and then I can handle the caching and timestamping outside in workOnPage(). Any help once again would be greatfully received. thanks, Selwyn
participants (3)
-
James Y Knight
-
Selwyn McCracken
-
Valentino Volonghi