[Twisted-Python] etag and last-modified

I have a script that downloads multiple rss/atom feeds via Feedparser. The script uses twisted.internet but the developer tells me there is no way to use etag and last-modified with twisted. Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use? Thanks, Jacob

On Fri, 05 Nov 2004 10:19:40 +0100, Jacob Friis <lists@debpro.webcom.dk> wrote:
I have a script that downloads multiple rss/atom feeds via Feedparser. The script uses twisted.internet but the developer tells me there is no way to use etag and last-modified with twisted.
This is not true. The twisted-web mailing list can provide you with details.
Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use?
Using threads to do this, there is no point using Twisted at all. Since Twisted is perfectly capable of downloading select web pages based on their headers, there's no reason to use threads. Jp

exarkun@divmod.com wrote:
Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use?
Using threads to do this, there is no point using Twisted at all. Since Twisted is perfectly capable of downloading select web pages based on their headers, there's no reason to use threads.
But I need to download approx 150000 files several times every day from approx 7000 servers. That's why I thought threading would be the solution. Can I use Twisted for this? Thanks, Jacob

On Fri, 05 Nov 2004 15:53:34 +0100, Jacob Friis <lists@debpro.webcom.dk> wrote:
exarkun@divmod.com wrote:
Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use?
Using threads to do this, there is no point using Twisted at all. Since Twisted is perfectly capable of downloading select web pages based on their headers, there's no reason to use threads.
But I need to download approx 150000 files several times every day from approx 7000 servers. That's why I thought threading would be the solution. Can I use Twisted for this?
Threads are an _inferior_ mechanism for network concurrency. Twisted can download multiple files simultaneously without using threads, and generally speaking, more efficiently than using threads. Jp

On Sat, 06 Nov 2004 16:30:41 +0100, Jacob Friis <lists@debpro.webcom.dk> wrote:
Twisted can download multiple files simultaneously without using threads, and generally speaking, more efficiently than using threads.
In which part of Twisted should I look for this feature, and do you know of example scripts?
I'm a Python beginner :)
I already implemented ETag and NotModified for the aggregator I wrote some time ago (which is in the python cookbook currently). Since I also wanted to not download the same feed many times I implemented a getPageCached() for twisted.web.client and submitted to twisted, unfortunately it was not accepted both because twisted.web is deprecated and because the patch provided an hard coded cache instead of using an interface (which would have been absolutely better and not that difficult, but since my patch wasn't going to be accepted I didn't bother anyway). Anyway you can find my patch for twisted.web.client here: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue612 -- Valentino Volonghi aka Dialtone Now running FreeBSD 5.3-beta6 Blog: http://vvolonghi.blogspot.com Home Page: http://xoomer.virgilio.it/dialtone/

On Fri, 05 Nov 2004 10:19:40 +0100, Jacob Friis <lists@debpro.webcom.dk> wrote:
I have a script that downloads multiple rss/atom feeds via Feedparser. The script uses twisted.internet but the developer tells me there is no way to use etag and last-modified with twisted.
I have been working an angle on this as well and have given up for the time being in terms of integrating a Twisted-based 'connector' in a way that urllib2 could use it - which is the best way of doing it. If you do it that way, then the interface is transparent. Problem is, urllib2 documentation is very confusing. I know, that probably sounds wierd on a twisted mailing list, but there you have it. :-) What I have now is that I use the classes in twisted.web.client to pull the page down, then feed it to feedparser. That means that I have to handle the headers and etag/last-modified stuff myself. But if you look at the code for feedparser, it's not that complicated. I do regret having to duplicate code, but it can't be helped unless Mark expands his interface a little. And ideally, I'd prefer to pass a twisted connection to urllib2 as a handler anyway. I'm attaching a small proof of concept for the non-urllib2 implementation I've been playing around with. It's very basic.
Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use?
Screw that. Been there, done that, it sucks. I say again, IT SUCKS. Did I mention it sucks? PC performance seems to degrade exponentially as you fire off more and more feedparser-threads. I've done it. Even with a modest throttle setting of 15 simultaneous connections, my system was chewing itself to bits. Granted this was Win32, but on the other hand I've established many times that many connections through the twisted interface, and seen virtually no indication that anything wa going on at all - system was smooth as glass. So there's the thing. Do a little extra work, and make it work RIGHT, or do a little extra work, and make it a bad user experience. If you hate your users, select option #2. -- Regards, Jeff

Might help to attach the file, sparky :-) On Fri, 5 Nov 2004 10:19:03 -0500, Jeff Grimmett <grimmtooth@gmail.com> wrote:
On Fri, 05 Nov 2004 10:19:40 +0100, Jacob Friis <lists@debpro.webcom.dk> wrote:
I have a script that downloads multiple rss/atom feeds via Feedparser. The script uses twisted.internet but the developer tells me there is no way to use etag and last-modified with twisted.
I have been working an angle on this as well and have given up for the time being in terms of integrating a Twisted-based 'connector' in a way that urllib2 could use it - which is the best way of doing it. If you do it that way, then the interface is transparent.
Problem is, urllib2 documentation is very confusing. I know, that probably sounds wierd on a twisted mailing list, but there you have it. :-)
What I have now is that I use the classes in twisted.web.client to pull the page down, then feed it to feedparser. That means that I have to handle the headers and etag/last-modified stuff myself. But if you look at the code for feedparser, it's not that complicated. I do regret having to duplicate code, but it can't be helped unless Mark expands his interface a little.
And ideally, I'd prefer to pass a twisted connection to urllib2 as a handler anyway.
I'm attaching a small proof of concept for the non-urllib2 implementation I've been playing around with. It's very basic.
Instead I'll let Feedparser do the download and use twisted for threads. What is the maximum pool size I can use?
Screw that. Been there, done that, it sucks. I say again, IT SUCKS. Did I mention it sucks? PC performance seems to degrade exponentially as you fire off more and more feedparser-threads. I've done it. Even with a modest throttle setting of 15 simultaneous connections, my system was chewing itself to bits. Granted this was Win32, but on the other hand I've established many times that many connections through the twisted interface, and seen virtually no indication that anything wa going on at all - system was smooth as glass.
So there's the thing. Do a little extra work, and make it work RIGHT, or do a little extra work, and make it a bad user experience.
If you hate your users, select option #2.
-- Regards,
Jeff
-- Regards, Jeff
participants (4)
-
exarkun@divmod.com
-
Jacob Friis
-
Jeff Grimmett
-
Valentino Volonghi