Mailman 3 adding etag and modified arguments to twisted feedparser - Twisted-web

adding etag and modified arguments to twisted feedparser

older
Remembering things in context from...

Selwyn McCracken

Sept. 28, 2004

5:53 p.m.

hi, I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099) I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation. Anyway, the series of deferred functions that I think are relevant are: 1) def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT) 2) def parseFeed(self, feed): parsed = feedparser.parse(cStringIO.StringIO(feed)) The problem is that getPage() requests the entire rss feed, and then passes the stream through to feedparser.parse. Normally however, feedparser.parse() takes furthers arguments of 'etag' and 'modified' so that only new feed information is returned, thereby saving bandwidth. I tried modifying getPage() to return feedparser.parse(args), and removing the need for parseFeed(), but it runs substantially slower than the original method, I presume in a synchronous manner. Any assistance in helping to restore the impressive parallel downloading performance, but with the the datetime arguments included, would be greatly appreciated. many thanks, Selwyn

Show replies by date

James Y Knight

September 2004

8:10 p.m.

On Sep 28, 2004, at 5:53 PM, Selwyn McCracken wrote:

...

I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)

I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation.

Not particularly deferred related, more t.w.client related. I assume what's happening is that feedparser.parse() can either take a URL or a file-like-object. If it takes a URL, it uses its internal HTTP getting method, which is synchronous. Twisted's HTTP client is asynchronous, so you want to use that. So what you need to know how to do is send the etag/modified information to Twisted's HTTP client. You want something like: def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT, headers={'If-None-Match': '"xyzzy"', 'If-Modified-Since': 'Sun, 09 Sep 2001 01:46:40 GMT'}) However, client.getPage doesn't leave you with any way to get at the response headers (so you can save the etag and last modified responses for the next request), so you'll need to use HTTPClientFactory directly (cribbing from the code in client.getPage). Basically, after the deferred fires, factory.response_headers will have the data you want, so you just need to keep a reference to factory around. James

Selwyn McCracken

8:24 p.m.

New subject: adding etag and modified arguments to twisted feedparser

thanks James, I'll look into that. Hopefully it wont result in too many more questions ;-) James Y Knight wrote:

...

On Sep 28, 2004, at 5:53 PM, Selwyn McCracken wrote:

...
I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)

I realise that the problem is deferred related, but I can't seem to resolve the problem, even after reading the deferred documentation.

Not particularly deferred related, more t.w.client related. I assume what's happening is that feedparser.parse() can either take a URL or a file-like-object. If it takes a URL, it uses its internal HTTP getting method, which is synchronous. Twisted's HTTP client is asynchronous, so you want to use that.

So what you need to know how to do is send the etag/modified information to Twisted's HTTP client.

You want something like:

def getPage(self, data, args): #args is the rss feed link return client.getPage(args,timeout=TIMEOUT, headers={'If-None-Match': '"xyzzy"', 'If-Modified-Since': 'Sun, 09 Sep 2001 01:46:40 GMT'})

However, client.getPage doesn't leave you with any way to get at the response headers (so you can save the etag and last modified responses for the next request), so you'll need to use HTTPClientFactory directly (cribbing from the code in client.getPage). Basically, after the deferred fires, factory.response_headers will have the data you want, so you just need to keep a reference to factory around.

James

_______________________________________________ Twisted-web mailing list Twisted-web@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web

Valentino Volonghi

8:25 p.m.

On Wed, 29 Sep 2004 09:53:31 +1200, Selwyn McCracken <selwyn.mccracken@stonebow.otago.ac.nz> wrote:

...

hi,

I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)

Glad to see someone found that stuff useful :). I'm the author of that recipe. You will also be glad to know that I already solved your problems months ago and you can find the solution here: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue612 Unfortunately, when I asked for addition in twisted.web Itamar rejected because twisted.web already was on its way to deprecation. It should work without any problems, and you should use client.getPageCached() instead of client.getPage(). -- Valentino Volonghi aka Dialtone Linux User #310274, Proud Gentoo User Blog: http://vvolonghi.blogspot.com Home Page: http://xoomer.virgilio.it/dialtone/

Selwyn McCracken

8:37 p.m.

New subject: adding etag and modified arguments to twisted feedparser

many thanks for both your answer and your recipe (it is very useful). I will try and get things working with getPageCached() Valentino Volonghi wrote:

...

On Wed, 29 Sep 2004 09:53:31 +1200, Selwyn McCracken <selwyn.mccracken@stonebow.otago.ac.nz> wrote:

...
hi,

I am having trouble modifying the twisted-based rss aggregator from the python cookbook so that feedparser can make use of the update related arguments of 'etag' and 'modified' to save bandwith. (see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277099)

Glad to see someone found that stuff useful :). I'm the author of that recipe.

You will also be glad to know that I already solved your problems months ago and you can find the solution here: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue612

Unfortunately, when I asked for addition in twisted.web Itamar rejected because twisted.web already was on its way to deprecation.

It should work without any problems, and you should use client.getPageCached() instead of client.getPage().

Selwyn McCracken

5:45 a.m.

New subject: adding etag and modified arguments to twisted feedparser

Hi Valentino, sorry for bugging you again. I have had a look through httpcache.py, but as a total newcomer to twisted I am slightly overwhelmed by the concepts of factories and protocols at this stage. I tried modifying getPage() from your recipe to include httpcache, so that I could see what was happening, like so: def getPage(self, data, args): return httpcache.getPageCached(args,timeout=TIMEOUT) however this triggers the following error: "global name '_parse' is not defined", and I'm not sure how to proceed. In any case, what I would ideally like is something like: def getPage(self, data, args): return httpcache.getPageCached(args,timeout=TIMEOUT, etag=_ETAG,modified=_MODIFIED) This would simply return the page in full or the 304 error, and then I can handle the caching and timestamping outside in workOnPage(). Any help once again would be greatfully received. thanks, Selwyn

7470

Age (days ago)

7471

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

James Y Knight
Selwyn McCracken
Valentino Volonghi

adding etag and modified arguments to twisted feedparser

Selwyn McCracken

James Y Knight

Selwyn McCracken

Valentino Volonghi

Selwyn McCracken

Selwyn McCracken

tags

participants (3)