Your limit will usually be the number of file descriptors in the system, which can be usually changed via ulimit or your system's equivalent.  On Linux I believe it defaults to 1024, so you should be able to handle 1024 simultaneous connections.

One thing of note is that you say you have concurrency issues handled -- but with asynchronous I/O, there are no concurrency issues, since there's no concurrency (at least, not at application level).  This is confusing at first but it's important to understand.

All that said, you probably want to maintain a queue of URLs and some sort of graph representation of your data for purposes of finding loops (e.g. A links to B, B links to C, C links to A).  You can then set an upper limit on the number of concurrent connections (say 1000) and track the number of deferreds in the system just based on when you start connections and when they finish (via callbacks).   Your initial seed can start one URL, and then its callback can hit all linked nodes, and so on and so on.

You might be hitting a cycle in the page traversal graph, and that is causing you all sorts of problems in terms of recursion depth or running out of file descriptors.  Without seeing your code or your target site, though, it's impossible to say.

Have you considered using another library for web spidering?  I believe Scrapy (http://scrapy.org) is a good spidering tool, and it might be easier to use a decent library than roll your own.


  - Matt



On Tue, Oct 6, 2009 at 10:40 PM, Steve Steiner (listsin) <listsin@integrateddevcorp.com> wrote:
So, I have a situation...

       I have an application whose basic function is, in simplified form:

       def main():
               get_web_page(main_page_from_params)

       def get_web_page(page_name):
               set up a page getter deferred,
                       one of the callbacks gets the links out of the page and sends them
to get_them()

       def get_them(links):
               for l in links:
                       if l is not being gotten or hasn't been got:
                               deferred = get_web_page(l)

       In other words, I feed in the top level page, then recursively feed
in any pages linked to by the current page, and they feed in all their
links, until all pages are gotten.

       I understand the concurrency issues with multiple deferred's trying
to add pages to the "get list" -- it's properly handled in the code
(far as I can tell, so far).

       So, here's the question...

       I have a "pages"  list containing all of the pages.

       They are set to either gotten or in-flight.

       In-flight means I have a deferred that's going to go get it (in
get_web_page()).

       IOW, right now, if I don't already have the page, and I have a link
to it, I just start a deferred to go get it.

       Should I limit the number of "in-flight" pages?

       Currently, I'm scanning sites that have upwards of 5000 pages and it
seems that, when I get too many deferred's in flight, the app
*appears* to crash.

       I'm not sure whether it's actually going out to lunch or just appears
that way and, before I go instrumenting the app to death, can anyone
tell me whether there is some sort of practical limit to how many "in-
flight" deferreds might start to cause issues, just due to the sheer
number?

       Thanks for any insight on this that anyone might offer.

       I expect the usual flurry of  "you must post your exact code or we
can't help you at all, moron" posts, but...

       In spite of my not having posted specific code, could someone with
some actual experience in this please give me a clue, within an order
of magnitude, how many deferreds might start to cause real trouble?

Thanks,

S




_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python