
On Wed, Mar 31, 2004 at 01:27:49PM +0200, Valentino Volonghi aka Dialtone wrote:
Andrew Bennetts wrote:
On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone wrote:
Hi all, attached you will find my rss-aggregator made with twisted.
It's really fast although when I tried with 745 feeds I got some problems. When the download reached 300 parsed feeds (more or less) it locked till I pressed Ctrl+C and then it processed the remaining 340 feeds in less than 30 seconds... I think that my design has at least an issue but I cannot find it so easily and I hope someone on this list can help me to improve it.
By default, Twisted uses the platform name resolver, which is blocking. Perhaps a non-existent domain is causing gethostbyname to block?
Uhmm... dunno, but I tried to remove the 'locking' feed-source and it didn't change.
Hmm, it's unlikely to be DNS lookups causing it, then. We need some way to narrow down where it's happening. There are a few options I can think of, but they're all a bit heavyweight... - Use strace to get some idea what it's doing - Use the --spew option of twistd (or manually install the spewer with "from twisted.python.util import spewer; sys.settrace(spewer)") - Use gdb to attach the process, then and look at the backtrace there. (You can apparently get the python backtrace in gdb by putting this macro in your .gdbinit: define ppystack while $pc < Py_Main || $pc > Py_GetArgcArgv if $pc > eval_frame && $pc < PyEval_EvalCodeEx set $__fn = PyString_AsString(co->co_filename) set $__n = PyString_AsString(co->co_name) printf "%s (%d): %s\n", $__fn, f->f_lineno, $__n end up-silently 1 end select-frame 0 end But I've never tried this... ) Is it possible that feedparser is hanging on trying to parse that feed? Perhaps trying putting print statements before and after the feedparser.parse call.
You should be able to test this theory by installing Twisted's resolver:
from twisted.names import client reactor.installResolver(client.createResolver())
client.createResolver makes a resonable effort to use your system's DNS configuration (by looking at /etc/resolve.conf on posix systems, for example), so it should work without any special arguments.
ok, it changes into a totally non-working script :)
I get a lot of these: [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__ /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP ]
Ouch. I wonder how that bug crept in? The twisted.names code is expecting a sequence of timeouts (to re-issue the query with, until failing at last), but twisted.internet is only giving it a single integer. I've filed a bug report for this: http://twistedmatrix.com/bugs/issue570, if you care :)
BTW When it finishes (with all 740 feeds) it reports an awesome 330 seconds which is an impressive time, less than half a second for each feed, and It downloads more than 50Mb of feeds from the net (with 745 feeds to download).
Nice!
Yup, was going to ask for my script to be used instead of asyncore to Straw developers. Straw has a lot of problems with 200 feeds ie resets the connection and such. This would be an awesome improvement.
Absolutely. I've heard similar complaints about straw, and I've been hoping some keen person would apply Twisted to fix the problem :) -Andrew.