Re: [Twisted-Python] Scalability of an rss-aggregator

31 Mar 2004

      On Wed, Mar 31, 2004 at 01:27:49PM +0200, Valentino Volonghi aka Dialtone wrote:
...
Andrew Bennetts wrote:
...
On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone 
wrote:
...
Hi all,
attached you will find my rss-aggregator made with twisted.
It's really fast although when I tried with 745 feeds I got some problems.
When the download reached 300 parsed feeds (more or less) it locked till 
I pressed Ctrl+C and then it
processed the remaining 340 feeds in less than 30 seconds... I think 
that my design has at least an issue
but  I cannot find it so easily and I hope someone on this list can help 
me to improve it.
By default, Twisted uses the platform name resolver, which is blocking.
Perhaps a non-existent domain is causing gethostbyname to block?
Uhmm... dunno, but I tried to remove the 'locking' feed-source and it 
didn't change.
Hmm, it's unlikely to be DNS lookups causing it, then.

We need some way to narrow down where it's happening.  There are a few
options I can think of, but they're all a bit heavyweight...

  - Use strace to get some idea what it's doing
  - Use the --spew option of twistd (or manually install the spewer with
    "from twisted.python.util import spewer; sys.settrace(spewer)")
  - Use gdb to attach the process, then and look at the backtrace there.

(You can apparently get the python backtrace in gdb by putting this macro in
your .gdbinit:

define ppystack
    while $pc < Py_Main || $pc > Py_GetArgcArgv
        if $pc > eval_frame && $pc < PyEval_EvalCodeEx
            set $__fn = PyString_AsString(co->co_filename)
            set $__n = PyString_AsString(co->co_name)
            printf "%s (%d): %s\n",  $__fn, f->f_lineno, $__n
        end
        up-silently 1
    end
    select-frame 0
end

But I've never tried this...
)

Is it possible that feedparser is hanging on trying to parse that feed?
Perhaps trying putting print statements before and after the
feedparser.parse call.
...
...
You should be able to test this theory by installing Twisted's resolver:
from twisted.names import client
  reactor.installResolver(client.createResolver())
client.createResolver makes a resonable effort to use your system's DNS
configuration (by looking at /etc/resolve.conf on posix systems, for
example), so it should work without any special arguments.
ok, it changes into a totally non-working script :)
I get a lot of these:
[Failure instance: Traceback: exceptions.TypeError, unsubscriptable object
/usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks
/usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__
/usr/lib/python2.3/site-packages/twisted/names/common.py:36:query
/usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords
/usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup
/usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP
]
Ouch.  I wonder how that bug crept in?  The twisted.names code is expecting a
sequence of timeouts (to re-issue the query with, until failing at last), but
twisted.internet is only giving it a single integer.  I've filed a bug
report for this: http://twistedmatrix.com/bugs/issue570, if you care :)
...
...
...
BTW When it finishes (with all 740 feeds) it reports an awesome 330 
seconds which is an impressive time, less than half a second
for each feed, and It downloads more than 50Mb of feeds from the net 
(with 745 feeds to download).
Nice!
Yup, was going to ask for my script to be used instead of asyncore to 
Straw developers.
Straw has a lot of problems with 200 feeds ie resets the connection and
such. This would be an awesome improvement.
Absolutely.  I've heard similar complaints about straw, and I've been hoping
some keen person would apply Twisted to fix the problem :)

-Andrew.