[Twisted-Python] Scalability of an rss-aggregator
Hi all, attached you will find my rss-aggregator made with twisted. It's really fast although when I tried with 745 feeds I got some problems. When the download reached 300 parsed feeds (more or less) it locked till I pressed Ctrl+C and then it processed the remaining 340 feeds in less than 30 seconds... I think that my design has at least an issue but I cannot find it so easily and I hope someone on this list can help me to improve it. The "script" is heavily commented. BTW When it finishes (with all 740 feeds) it reports an awesome 330 seconds which is an impressive time, less than half a second for each feed, and It downloads more than 50Mb of feeds from the net (with 745 feeds to download). Thx for your help. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/ from twisted.internet import reactor, protocol, defer from twisted.web import client import feedparser, time, out rss_feeds = out.rss_feed # This is the default site list #rss_feeds = [('http://www.nongnu.org/straw/news.rss','straw'), # ('http://googlenews.74d.com/rss/google_it.rss','google'), # ('http://www.pythonware.com/daily/rss.xml','pythonware'), # ('http://www.theinquirer.net/inquirer.rss','inq'), # ('http://www.groklaw.net/backend/GrokLaw.rdf','grok'), # ('http://www.livejournal.com/users/moshez/data/rss','zadka'), # ('http://www.pythonware.com/news.rdf','pwn')] # michele@berthold.com INTER_QUERY_TIME = 300 class FeederProtocol(object): def __init__(self): self.parsed = 0 # This dict structure will be the following: # { 'URL': (TIMESTAMP, value) } self.cache = {} def gotError(self, data=None, extra_args=None): # An Error as occurred, print traceback infos and go on print data self.parsed += 1 print "="*20 print "Trying to go on..." def getFeeds(self, where=None): #print "getting feeds" # This is to get the feeds we want if not where: # We don't have a database, then we use the local # variabile rss_feeds return rss_feeds else: return None def memoize(self, feed, site=None, extra=None): # extra is the second element of each tuple of rss_feeds # site is the address of the feed, also the first element of each tuple # of rss_feeds print "Memoizing",site,"..." self.cache.setdefault(site, (time.time(),feed)) return feed def stopWorking(self, data=None): print "Closing connection number %d..."%(self.parsed,) print "-"*20 # This is here only for testing. When a protocol/interface will be # created to communicate with this rss-aggregator server, we won't need # to die after we parsed some feeds just one time. self.parsed += 1 if self.parsed >= len(rss_feeds): print "Closing all..." #for i in self.cache: # print i print time.time()-tp #reactor.stop() def getPageFromMemory(self, key=None): #print "getting from memory" # Getting the second element of the tuple which is the parsed structure # of the feed at address key, the first element of the tuple is the # timestamp d = defer.succeed(self.cache.get(key,key)[1]) return d def parseFeed(self, feed): # This is self explaining :) return feedparser.parse(feed) def startDownloading(self, site): #print "Looking if",site[0],"cached...", # Try to get the tuple (TIMESTAMP, FEED_STRUCT) from the dict if it has # already been downloaded. Otherwise assign None to already_got already_got = self.cache.get(site[0], None) # Ok guys, we got it cached, let's see what we will do if already_got: # Well, it's cached, but will it be recent enough? #print "It is\n Looking if timestamp for",site[0],"is recent enough...", elapsed_time = time.time() - already_got[0] # Woooohooo it is, elapsed_time is less than INTER_QUERY_TIME so I # can get the page from the memory, recent enough if elapsed_time < INTER_QUERY_TIME: #print "It is" return self.getPageFromMemory(site[0]) else: # Uhmmm... actually it's a bit old, I'm going to get it from the # Net then, then I'll parse it and then I'll try to memoize it # again #print "Getting",site[0],"from the Net because old" return self.downloadPage(site) else: # Well... We hadn't it cached in, so we need to get it from the Net # now, It's useless to check if it's recent enough, it's not there. #print "Getting",site[0],"from the Net" return self.downloadPage(site) def downloadPage(self, site): #print "Now downloading..." # Self-explanatory d = client.getPage(site[0]) # Uncomment the following if you want to make everything crash :), since # it will save the feed on a file, but with the memoize feature it will # crash everything cuz it will break the get-->parse-->memoize chain #d = client.downloadPage(site[0],site[1]) # Parse the feed and if there's some errors call self.gotError d.addCallbacks(self.parseFeed, self.gotError) # Now memoize it, if there's some error call self.getError d.addCallbacks(self.memoize, self.gotError, site) return d def workOnPage(self, parsed_feed=None, site=None, extra_args=None, extra_key=None): print "-"*20 #print "finished retrieving" print "Feed Version:",parsed_feed.get('version','Unknown') # # Uncomment the following if you want to print the feeds # chan = parsed_feed.get('channel', None) if chan: print chan.get('title', '') #print chan.get('link', '') #print chan.get('tagline', '') #print chan.get('description','') print "-"*20 #items = parsed_feed.get('items', None) #if items: # for item in items: # print '\tTitle: ', item.get('title','') # print '\tDate: ', item.get('date', '') # print '\tLink: ', item.get('link', '') # print '\tDescription: ', item.get('description', '') # print '\tSummary: ', item.get('summary','') # print "-"*20 #print "got",site #print "="*40 def start(self, data=None): # Here we gather all the urls for the feeds #self.factory.tries += 1 for feed in self.getFeeds(): # Now we start telling the reactor that it has # to get all the feeds one by one... d = self.startDownloading(feed) # The it will pass the result of # startDownloading to workOnPage (this is hidden in twisted) # together with the feed url (just to use some extra infos # in the workOnPage method) d.addCallbacks(self.workOnPage, self.gotError, feed) # And when the for loop is ended we put # stopWorking on the callback for the last # feed gathered d.addCallbacks(self.stopWorking, self.gotError) # This is to try the memoize feature #if self.factory.tries<3: # d.addCallback(self.start) class FeederFactory(protocol.ClientFactory): protocol = FeederProtocol() def __init__(self): # tries is used to make more connection to use the # memoizing feature #self.tries = 0 # Here we put in the FeederProtocol instance a reference to # FeederFactory under the name of self.factory (seen from protocol) self.protocol.factory = self self.protocol.start() f = FeederFactory() tp = time.time() reactor.run()
On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone wrote:
Hi all, attached you will find my rss-aggregator made with twisted.
It's really fast although when I tried with 745 feeds I got some problems. When the download reached 300 parsed feeds (more or less) it locked till I pressed Ctrl+C and then it processed the remaining 340 feeds in less than 30 seconds... I think that my design has at least an issue but I cannot find it so easily and I hope someone on this list can help me to improve it.
By default, Twisted uses the platform name resolver, which is blocking. Perhaps a non-existent domain is causing gethostbyname to block? You should be able to test this theory by installing Twisted's resolver: from twisted.names import client reactor.installResolver(client.createResolver()) client.createResolver makes a resonable effort to use your system's DNS configuration (by looking at /etc/resolve.conf on posix systems, for example), so it should work without any special arguments.
The "script" is heavily commented.
BTW When it finishes (with all 740 feeds) it reports an awesome 330 seconds which is an impressive time, less than half a second for each feed, and It downloads more than 50Mb of feeds from the net (with 745 feeds to download).
Nice!
Thx for your help.
Not a problem. Let us know if it helps. -Andrew.
Andrew Bennetts wrote:
On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone wrote:
Hi all, attached you will find my rss-aggregator made with twisted.
It's really fast although when I tried with 745 feeds I got some problems. When the download reached 300 parsed feeds (more or less) it locked till I pressed Ctrl+C and then it processed the remaining 340 feeds in less than 30 seconds... I think that my design has at least an issue but I cannot find it so easily and I hope someone on this list can help me to improve it.
By default, Twisted uses the platform name resolver, which is blocking. Perhaps a non-existent domain is causing gethostbyname to block?
Uhmm... dunno, but I tried to remove the 'locking' feed-source and it didn't change.
You should be able to test this theory by installing Twisted's resolver:
from twisted.names import client reactor.installResolver(client.createResolver())
client.createResolver makes a resonable effort to use your system's DNS configuration (by looking at /etc/resolve.conf on posix systems, for example), so it should work without any special arguments.
ok, it changes into a totally non-working script :) I get a lot of these: [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__ /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP ]
BTW When it finishes (with all 740 feeds) it reports an awesome 330 seconds which is an impressive time, less than half a second for each feed, and It downloads more than 50Mb of feeds from the net (with 745 feeds to download).
Nice!
Yup, was going to ask for my script to be used instead of asyncore to Straw developers. Straw has a lot of problems with 200 feeds ie resets the connection and such. This would be an awesome improvement. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
On Wed, Mar 31, 2004 at 01:27:49PM +0200, Valentino Volonghi aka Dialtone wrote:
Andrew Bennetts wrote:
On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone wrote:
Hi all, attached you will find my rss-aggregator made with twisted.
It's really fast although when I tried with 745 feeds I got some problems. When the download reached 300 parsed feeds (more or less) it locked till I pressed Ctrl+C and then it processed the remaining 340 feeds in less than 30 seconds... I think that my design has at least an issue but I cannot find it so easily and I hope someone on this list can help me to improve it.
By default, Twisted uses the platform name resolver, which is blocking. Perhaps a non-existent domain is causing gethostbyname to block?
Uhmm... dunno, but I tried to remove the 'locking' feed-source and it didn't change.
Hmm, it's unlikely to be DNS lookups causing it, then. We need some way to narrow down where it's happening. There are a few options I can think of, but they're all a bit heavyweight... - Use strace to get some idea what it's doing - Use the --spew option of twistd (or manually install the spewer with "from twisted.python.util import spewer; sys.settrace(spewer)") - Use gdb to attach the process, then and look at the backtrace there. (You can apparently get the python backtrace in gdb by putting this macro in your .gdbinit: define ppystack while $pc < Py_Main || $pc > Py_GetArgcArgv if $pc > eval_frame && $pc < PyEval_EvalCodeEx set $__fn = PyString_AsString(co->co_filename) set $__n = PyString_AsString(co->co_name) printf "%s (%d): %s\n", $__fn, f->f_lineno, $__n end up-silently 1 end select-frame 0 end But I've never tried this... ) Is it possible that feedparser is hanging on trying to parse that feed? Perhaps trying putting print statements before and after the feedparser.parse call.
You should be able to test this theory by installing Twisted's resolver:
from twisted.names import client reactor.installResolver(client.createResolver())
client.createResolver makes a resonable effort to use your system's DNS configuration (by looking at /etc/resolve.conf on posix systems, for example), so it should work without any special arguments.
ok, it changes into a totally non-working script :)
I get a lot of these: [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__ /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP ]
Ouch. I wonder how that bug crept in? The twisted.names code is expecting a sequence of timeouts (to re-issue the query with, until failing at last), but twisted.internet is only giving it a single integer. I've filed a bug report for this: http://twistedmatrix.com/bugs/issue570, if you care :)
BTW When it finishes (with all 740 feeds) it reports an awesome 330 seconds which is an impressive time, less than half a second for each feed, and It downloads more than 50Mb of feeds from the net (with 745 feeds to download).
Nice!
Yup, was going to ask for my script to be used instead of asyncore to Straw developers. Straw has a lot of problems with 200 feeds ie resets the connection and such. This would be an awesome improvement.
Absolutely. I've heard similar complaints about straw, and I've been hoping some keen person would apply Twisted to fix the problem :) -Andrew.
Andrew Bennetts wrote:
Hmm, it's unlikely to be DNS lookups causing it, then.
We need some way to narrow down where it's happening. There are a few options I can think of, but they're all a bit heavyweight...
- Use strace to get some idea what it's doing - Use the --spew option of twistd (or manually install the spewer with "from twisted.python.util import spewer; sys.settrace(spewer)") - Use gdb to attach the process, then and look at the backtrace there.
(You can apparently get the python backtrace in gdb by putting this macro in your .gdbinit:
define ppystack while $pc < Py_Main || $pc > Py_GetArgcArgv if $pc > eval_frame && $pc < PyEval_EvalCodeEx set $__fn = PyString_AsString(co->co_filename) set $__n = PyString_AsString(co->co_name) printf "%s (%d): %s\n", $__fn, f->f_lineno, $__n end up-silently 1 end select-frame 0 end
But I've never tried this... )
Is it possible that feedparser is hanging on trying to parse that feed? Perhaps trying putting print statements before and after the feedparser.parse call.
Maybe the problem is there, but then I wouldn't answer the other question: "Why does it takes at most 30 second to parse all the remaining 350 feeds?" There is no network activity after the unlocking "Ctrl+C"... Gotta investigate then.
You should be able to test this theory by installing Twisted's resolver:
from twisted.names import client reactor.installResolver(client.createResolver())
client.createResolver makes a resonable effort to use your system's DNS configuration (by looking at /etc/resolve.conf on posix systems, for example), so it should work without any special arguments.
ok, it changes into a totally non-working script :)
I get a lot of these: [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__ /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP ]
Ouch. I wonder how that bug crept in? The twisted.names code is expecting a sequence of timeouts (to re-issue the query with, until failing at last), but twisted.internet is only giving it a single integer. I've filed a bug report for this: http://twistedmatrix.com/bugs/issue570, if you care :)
Sure :), this is the second bug for me, the first one was a documentation bug, the finger tutorial has some errors :).
Absolutely. I've heard similar complaints about straw, and I've been hoping some keen person would apply Twisted to fix the problem :)
That was my hope too, but since a friend of mine asked for an rss-aggregator made with twisted... I realized that someone wants me to be that keen person. Oooohhhh Which thing has the fate classified for me? Ooooooohhhhh :P -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
Valentino Volonghi aka Dialtone wrote:
Maybe the problem is there, but then I wouldn't answer the other question: "Why does it takes at most 30 second to parse all the remaining 350 feeds?" There is no network activity after the unlocking "Ctrl+C"... Gotta investigate then.
Now... The problem is not a parsing problem. I just made a last test (and am performing another one right now with spewer). The code locked at 404th feed downloaded without anything running (no parsing and no memoizing). No network activity just after the 404th. Till the 404th it's not very fast, after the lock (and my Ctrl+C) it goes at light speed till the 758th feed. (and this can be because of it's waiting for the feeds to get downloaded, and this happens at about the 400th) Ok, now the "debug" version has just stopped, here is the last output: function callWithLogger in /usr/lib/python2.3/site-packages/twisted/python/log.py, line 54 method logPrefix of twisted.internet.tcp.Client at 1085149132 function callWithContext in /usr/lib/python2.3/site-packages/twisted/python/log.py, line 49 method getContext of twisted.python.context.ContextTracker at 1081585292 method callWithContext of twisted.python.context.ContextTracker at 1081585292 method _doReadOrWrite of twisted.internet.default.SelectReactor at 1077521228 method doRead of twisted.internet.tcp.Client at 1085149132 method fileno of socket._socketobject at 1085157164 method removeReader of twisted.internet.default.SelectReactor at 1077521228 method removeWriter of twisted.internet.default.SelectReactor at 1077521228 method connectionLost of twisted.internet.tcp.Client at 1085149132 method connectionLost of twisted.internet.tcp.Client at 1085149132 method connectionLost of twisted.internet.tcp.Client at 1085149132 method _closeSocket of twisted.internet.tcp.Client at 1085149132 method shutdown of socket._socketobject at 1085157164 method connectionLost of twisted.web.client.HTTPPageGetter at 1086481388 method connectionLost of twisted.web.client.HTTPPageGetter at 1086481388 method handleResponseEnd of twisted.web.client.HTTPPageGetter at 1086481388 method noPage of twisted.web.client.HTTPClientFactory at 1085148876 method connectionLost of twisted.internet.tcp.Connector at 1085149100 method clientConnectionLost of twisted.web.client.HTTPClientFactory at 1085148876 method doStop of twisted.web.client.HTTPClientFactory at 1085148876 method __repr__ of twisted.web.client.HTTPClientFactory at 1085148876 method msg of twisted.python.log.LogPublisher at 1081585612 method getContext of twisted.python.context.ContextTracker at 1081585292 method _emit of twisted.python.log.DefaultObserver at 1081585644 method stopFactory of twisted.web.client.HTTPClientFactory at 1085148876 method runUntilCurrent of twisted.internet.default.SelectReactor at 1077521228 method timeout of twisted.internet.default.SelectReactor at 1077521228 method doSelect of twisted.internet.default.SelectReactor at 1077521228 method fileno of socket._socketobject at 1086221044 method fileno of socket._socketobject at 1086589444 method fileno of socket._socketobject at 1085552996 method fileno of socket._socketobject at 1086453556 [LOTS OF THESE] method fileno of socket._socketobject at 1086477372 method fileno of socket._socketobject at 1086623612 method fileno of socket._socketobject at 1085203316 method fileno of socket._socketobject at 1085733820 ##@#@#@ Ctrl+C And now it goes on till the end at warp speed. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
Valentino Volonghi aka Dialtone wrote:
The code locked at 404th feed downloaded without anything running (no parsing and no memoizing).
No network activity just after the 404th. Till the 404th it's not very fast, after the lock (and my Ctrl+C) it goes at light speed till the 758th feed. (and this can be because of it's waiting for the feeds to get downloaded, and this happens at about the 400th)
I went on testing and found some interesting things... I also tried with just 36 feeds and it locked in the same way, then I tested again with that feed and everything worked fine. So maybe it's the resolver that locks. I tried with the solution, that Andrew posted on the issue tracker but I got a lot of this: [Failure instance: Traceback: twisted.internet.defer.TimeoutError, [Query('www.ozzie.net', 255, 1)] ] ==================== Trying to go on... parsing... [Failure instance: Traceback: exceptions.AttributeError, 'NoneType' object has no attribute 'find' /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks twisted-rss.py:107:parseFeed /home/dialtone/programmi_didattici/rss-aggregator/feedparser.py:1679:parse /home/dialtone/programmi_didattici/rss-aggregator/feedparser.py:1289:_open_resource /usr/lib/python2.3/urlparse.py:49:urlparse /usr/lib/python2.3/urlparse.py:79:urlsplit ] ==================== Trying to go on... Memoizing http://www.ozzie.net/blog/rss.xml ... -------------------- finished retrieving Feed Version: [Failure instance: Traceback: exceptions.AttributeError, 'NoneType' object has no attribute 'get' /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks twisted-rss.py:123:workOnPage ] ==================== Trying to go on... Closing connection number 720... =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Obviously the error is the first one, since without a feed to parse, all the chain fails. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
On Thu, Apr 01, 2004 at 12:13:03PM +0200, Valentino Volonghi aka Dialtone wrote:
I went on testing and found some interesting things...
I just found something interesting too -- I just took a quick peek at feedparser, and the parse function looks like it fetches the page synchronously using urllib. That's *not* a good thing to do from inside Twisted's main loop. It looks like you want your parseFeed method to call: r = FeedParser(baseuri) r.feed(data) like feedparser.parse does internally. (You'll need to do a little bit of work to return the same sort dictionary that parse constructs for you). -Andrew.
Andrew Bennetts wrote:
On Thu, Apr 01, 2004 at 12:13:03PM +0200, Valentino Volonghi aka Dialtone wrote:
I went on testing and found some interesting things...
I just found something interesting too -- I just took a quick peek at feedparser, and the parse function looks like it fetches the page synchronously using urllib. That's *not* a good thing to do from inside Twisted's main loop. It looks like you want your parseFeed method to call:
r = FeedParser(baseuri) r.feed(data)
like feedparser.parse does internally. (You'll need to do a little bit of work to return the same sort dictionary that parse constructs for you).
I verified... The parser only downloads if as url is supplied, otherwise it does not. Anyway, since it needs a StringIO like argument, now I do myself the conversion into StringIO and this will make the parser always return as fast as possible from _open_resource(). I'm starting to think that this is some kind of 'race condition' inside twisted, or something similar. BTW, I'll investigate more and more :) -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
On Thu, Apr 01, 2004 at 04:03:07PM +0200, Valentino Volonghi aka Dialtone wrote:
I'm starting to think that this is some kind of 'race condition' inside twisted, or something similar. BTW, I'll investigate more and more :)
But the traceback you posted from the deferred chain was from the guts of feedparser... -Andrew.
Andrew Bennetts wrote:
But the traceback you posted from the deferred chain was from the guts of feedparser...
Looking at strace output it seems more like Twisted-related... Here is the output: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ gettimeofday({1080889267, 866232}, NULL) = 0 close(313) = 0 gettimeofday({1080889267, 866410}, NULL) = 0 select(726, [4 13 17 18 23 26 28 31 39 40 42 44 45 47 49 50 51 52 55 58 60 63 66 67 69 73 74 77 78 80 81 82 86 89 92 98 99 101 104 108 109 110 111 112 115 116 122 124 127 128 130 131 132 139 142 145 147 149 153 157 158 162 164 171 172 173 176 178 180 182 183 185 187 188 190 191 192 193 194 195 196 225 226 227 228 230 232 234 237 238 241 246 248 249 250 251 253 254 256 257 258 260 262 263 264 267 268 269 271 273 274 275 277 278 281 282 284 287 294 296 297 298 299 301 303 304 305 306 309 310 312 314 316 317 321 322 325 327 329 330 331 332 334 335 336 339 340 343 345 346 348 349 350 351 356 359 361 364 365 366 367 368 371 372 373 374 375 378 379 380 383 385 386 387 390 393 396 397 399 400 401 405 406 407 408 410 414 415 417 422 427 428 430 431 432 436 438 439 440 442 443 445 447 448 450 451 452 453 455 456 458 459 462 464 467 468 476 477 479 480 481 482 486 488 489 493 494 496 497 498 503 505 506 507 508 510 511 513 514 515 517 520 523 525 527 528 529 534 535 537 538 539 544 547 550 553 554 556 558 559 560 561 565 566 569 572 574 575 577 579 580 586 587 588 589 590 592 593 594 597 598 600 601 602 603 604 605 606 609 610 611 612 613 615 620 622 623 628 630 631 633 634 635 637 640 643 645 651 654 655 659 660 664 665 666 667 670 671 675 676 677 678 681 683 684 687 715 725], [], [], NULL) = 1 (in [447]) recv(447, 0xa3cc9e4, 65536, 0) = -1 ECONNRESET (Connection reset by peer) shutdown(447, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) write(1, "parsing...\n", 11parsing... ) = 11 futex(0x8067858, FUTEX_WAKE, 1) = 0 futex(0x8067858, FUTEX_WAKE, 1) = 0 futex(0x8067858, FUTEX_WAKE, 1) = 0 write(1, "parsed feed\n", 12parsed feed ) = 12 write(1, "Memoizing http://weblogs.asp.net"..., 53Memoizing http://weblogs.asp.net/JohanL/rss.aspx ... ) = 53 gettimeofday({1080889267, 968545}, NULL) = 0 write(1, "--------------------\n", 21-------------------- ) = 21 write(1, "finished retrieving\n", 20finished retrieving ) = 20 write(1, "Feed Version: \n", 15Feed Version: ) = 15 write(1, "--------------------\n", 21-------------------- ) = 21 write(1, "Closing connection number 404..."..., 33Closing connection number 404... ) = 33 write(1, "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"..., 41=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ) = 41 write(1, "405 730\n", 8405 730 ) = 8 gettimeofday({1080889267, 974236}, NULL) = 0 close(447) = 0 gettimeofday({1080889267, 974428}, NULL) = 0 select(726, [4 13 17 18 23 26 28 31 39 40 42 44 45 47 49 50 51 52 55 58 60 63 66 67 69 73 74 77 78 80 81 82 86 89 92 98 99 101 104 108 109 110 111 112 115 116 122 124 127 128 130 131 132 139 142 145 147 149 153 157 158 162 164 171 172 173 176 178 180 182 183 185 187 188 190 191 192 193 194 195 196 225 226 227 228 230 232 234 237 238 241 246 248 249 250 251 253 254 256 257 258 260 262 263 264 267 268 269 271 273 274 275 277 278 281 282 284 287 294 296 297 298 299 301 303 304 305 306 309 310 312 314 316 317 321 322 325 327 329 330 331 332 334 335 336 339 340 343 345 346 348 349 350 351 356 359 361 364 365 366 367 368 371 372 373 374 375 378 379 380 383 385 386 387 390 393 396 397 399 400 401 405 406 407 408 410 414 415 417 422 427 428 430 431 432 436 438 439 440 442 443 445 448 450 451 452 453 455 456 458 459 462 464 467 468 476 477 479 480 481 482 486 488 489 493 494 496 497 498 503 505 506 507 508 510 511 513 514 515 517 520 523 525 527 528 529 534 535 537 538 539 544 547 550 553 554 556 558 559 560 561 565 566 569 572 574 575 577 579 580 586 587 588 589 590 592 593 594 597 598 600 601 602 603 604 605 606 609 610 611 612 613 615 620 622 623 628 630 631 633 634 635 637 640 643 645 651 654 655 659 660 664 665 666 667 670 671 675 676 677 678 681 683 684 687 715 725], [], [], NULL @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Here it lock, but now, since I'm using strace, it won't restart after Ctrl+C. As I said before... The download of _ALL_ feeds has already finished when twisted locks, -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
On Fri, Apr 02, 2004 at 09:12:41AM +0200, Valentino Volonghi aka Dialtone wrote:
Andrew Bennetts wrote:
But the traceback you posted from the deferred chain was from the guts of feedparser...
Looking at strace output it seems more like Twisted-related... Here is the output:
[...]
select(726, [4 13 17 18 23 26 28 31 39 40 42 44 45 47 49 50 51 52 55 58 [...] 665 666 667 670 671 675 676 677 678 681 683 684 687 715 725], [], [], NULL
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Here it lock, but now, since I'm using strace, it won't restart after Ctrl+C.
As I said before... The download of _ALL_ feeds has already finished when twisted locks,
But then why is there still a huge number of file descriptors in the select call? Something is definitely very odd... :/ -Andrew.
Andrew Bennetts wrote:
665 666 667 670 671 675 676 677 678 681 683 684 687 715 725], [], [], NULL
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Here it lock, but now, since I'm using strace, it won't restart after Ctrl+C.
As I said before... The download of _ALL_ feeds has already finished when twisted locks,
But then why is there still a huge number of file descriptors in the select call? Something is definitely very odd... :/
well, I think that's because there are more deferreds to be called or because even if the page has been downloaded the reactor was busy with other feeds and not checking the select, I think... I'm not that expert in twisted's internals, as you probably are. thx for your help, I hope to solve this problem... -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
Andrew Bennetts wrote: Oh... forgot to say. A full version of the aggregator is ready for download at http://xoomer.virgilio.it/dialtone/rss-aggregator.tar.bz2 Just in case you want to try the full download yourself, to see what's going on. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
Andrew Bennetts wrote:
But then why is there still a huge number of file descriptors in the select call? Something is definitely very odd... :/
I made a very little (30 lines) script that reproduces the error which is attached: the out file is located here: http://xoomer.virgilio.it/dialtone/out.py It only contains addressed, it's a single list of 730 addresses. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/ from twisted.internet import reactor, protocol, defer from twisted.web import client import out import cStringIO as _StringIO # # The problem is at about the 400 (more or less) body downloaded # since Twisted locks and I need to press Ctrl+C to unlock it # Using strace you can see that at the moment of locking, although # there is no download in progress, there are over 300 sockets already # watched in the main select. # looking with spewer you can see that it locks when closing a socket # NUM=0 def printer(data, args=None): global NUM print 'got data', NUM return data def transf(data, args=None): transfd_data = _StringIO.StringIO(str(data)) return transfd_data def gotError(data, args=None): global NUM print 'got error' return def ender(data,args=None): global NUM NUM += 1 if NUM > len(out.rss_feed): reactor.stop() def main(): for i in out.rss_feed: d = client.getPage(i[0]) d.addCallback(printer) d.addErrback(gotError) d.addCallback(transf) d.addErrback(gotError) d.addCallback(ender) d.addErrback(gotError) print "finished setting all deferreds" main() reactor.run()
Hi Valentino,
I made a very little (30 lines) script that reproduces the error which is attached:
the out file is located here: http://xoomer.virgilio.it/dialtone/out.py It only contains addressed, it's a single list of 730 addresses.
Just tested twice: first time I had to press Ctrl+C on the 718th feed, second time I had to press Ctrl+C on the 724th feed. Regards, Matteo
Matteo Giacomazzi wrote:
Hi Valentino,
Hi Matteo,
Just tested twice: first time I had to press Ctrl+C on the 718th feed, second time I had to press Ctrl+C on the 724th feed.
Ok then, at least I'm not the only one with this strange behaviour. I hope that itamar and exarkun are running the script too (I'm talking to them on irc)... It seems to be a bug in twisted.web.client.getPage() while closing the connection. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
Matteo Giacomazzi wrote: http://www.twistedmatrix.com/users/roundup.twistd/twisted/issue578 Ok, I filed an Issue on twisted at the address above. I hope it will be corrected as soon as possible. -- Valentino Volonghi aka Dialtone Linux User #310274, Gentoo Proud User X Python Newsreader developer http://sourceforge.net/projects/xpn/
participants (3)
-
Andrew Bennetts
-
Matteo Giacomazzi
-
Valentino Volonghi aka Dialtone