[Twisted-Python] clientfactory cleanup slow-down (after many http requests)
Hello, I've been working on a small Twisted program. The program makes HTTP requests to a large number of feeds. Twisted is used to speed up the entire process. After the feeds are fetched, they're parsed. Finally they should be written to a database (to simplify the code, that part is left out). Feeds are fetched in parallel using gatherResults, and a batch is built. Then all batches are again gathered into a set of batches, a DeferredList is built out of those. A semaphore controls both the batch-level list of deferreds, and a semaphore controls the entire batch list deferred. Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between 5 and 20. However, I notice the program starts to hang for a long time, when the number of feeds goes over 150-200. To be more precise, at the end of running the program, messages like these are printed, but the program seems to not be very active: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908> It seems like this is the cleanup phase. I've read what I could find on the topic. I wasn't able to make progress on it, so I'm posting to the mailing list to ask if someone has encountered this before. Maybe it's a common pitfall or issue that other people have also bumped into. Thanks
On Aug 6, 2016, at 03:48, Randomcoder <randomcoder1@gmail.com> wrote:
Hello,
I've been working on a small Twisted program.
Cool, thanks for using Twisted.
The program makes HTTP requests to a large number of feeds. Twisted is used to speed up the entire process. After the feeds are fetched, they're parsed. Finally they should be written to a database (to simplify the code, that part is left out).
Thanks for including examples, so we know exactly what you're talking about! :)
Feeds are fetched in parallel using gatherResults, and a batch is built. Then all batches are again gathered into a set of batches, a DeferredList is built out of those. A semaphore controls both the batch-level list of deferreds, and a semaphore controls the entire batch list deferred.
Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between 5 and 20.
This all seems pretty reasonable and following best practices and such...
However, I notice the program starts to hang for a long time, when the number of feeds goes over 150-200.
Two key questions: what do you mean by "hang" and what is "a long time"? Do you mean it's totally unresponsive, or do you mean it's just failing to make progress on downloading more feeds?
To be more precise, at the end of running the program, messages like these are printed, but the program seems to not be very active:
Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>
It seems like this is the cleanup phase.
This just means that it is finished making connections. We have to do some clean-up around the usefulness of these log messages, sorry :-\.
I've read what I could find on the topic. I wasn't able to make progress on it, so I'm posting to the mailing list to ask if someone has encountered this before. Maybe it's a common pitfall or issue that other people have also bumped into.
Right now, my guess is this: some of the sites you're contacting have very slow proxies, or for some other reason let you connect to them, but then hang when sent requests. If you're simultaneously requesting stuff from a very large number of different sites, this is sort of inevitably bound to happen, either based on network problems, or issues with the sites themselves. I suspect you thought that the connectTimeout argument to Agent would save you from this, but that timeout is just for making the initial underlying TCP connection, not receiving a full response. What you actually want to do is cancel the Deferred returned by Agent.request. Luckily, https://treq.readthedocs.io/en/latest/ <https://treq.readthedocs.io/en/latest/> already implements this high-level timeout functionality for you, in the form of the 'timeout=' argument it accepts. If you give that a try, do you see more connections timing out as it runs, rather than "hanging" the process for long periods of time? As long as I'm looking at your code, as a way of thanking you for providing such a nice specific runnable example, I have a few other random thoughts which may be useful to you: - I see you're importing psycopg. Do you know about https://txpostgres.readthedocs.io/en/latest/ <https://txpostgres.readthedocs.io/en/latest/> ? You can talk to postgres asynchronously with Twisted. - d.addCallback(lambda out: out).addCallback(lambda resp: client.readBody(resp)) can be much more briefly spelled "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does nothing and can just be removed. - BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that. - clean_up_and_exit will only be called if batchesDef doesn't fail, and if it does fail, it will swallow the exception message. Rather than manually calling `reactor.stop`, you probably want to use react(), <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#re... <https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react>>. This way your function is an API that anyone who wants to use it can call - it just returns a Deferred when it's done - but your __main__ block calls react() which will both start and stop the reactor, as well as reporting errors if there's a problem while still shutting down. Hope some of that code review is helpful - let us know if the treq timeout solves the problem or if the issue is somewhere else! -glyph
Wow! This is the friendliest way to welcome a new Twisted programmer. Great job Glyph! :) Regards, Manish On Sat, Aug 6, 2016 at 3:51 PM, Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
On Aug 6, 2016, at 03:48, Randomcoder <randomcoder1@gmail.com> wrote:
Hello,
I've been working on a small Twisted program.
Cool, thanks for using Twisted.
The program makes HTTP requests to a large number of feeds. Twisted is used to speed up the entire process. After the feeds are fetched, they're parsed. Finally they should be written to a database (to simplify the code, that part is left out).
Thanks for including examples, so we know exactly what you're talking about! :)
Feeds are fetched in parallel using gatherResults, and a batch is built. Then all batches are again gathered into a set of batches, a DeferredList is built out of those. A semaphore controls both the batch-level list of deferreds, and a semaphore controls the entire batch list deferred.
Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between 5 and 20.
This all seems pretty reasonable and following best practices and such...
However, I notice the program starts to hang for a long time, when the number of feeds goes over 150-200.
Two key questions: what do you mean by "hang" and what is "a long time"? Do you mean it's totally unresponsive, or do you mean it's just failing to make progress on downloading more feeds?
To be more precise, at the end of running the program, messages like these are printed, but the program seems to not be very active:
Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f0b7d5f3908>
It seems like this is the cleanup phase.
This just means that it is finished making connections. We have to do some clean-up around the usefulness of these log messages, sorry :-\.
I've read what I could find on the topic. I wasn't able to make progress on it, so I'm posting to the mailing list to ask if someone has encountered this before. Maybe it's a common pitfall or issue that other people have also bumped into.
Right now, my guess is this: some of the sites you're contacting have very slow proxies, or for some other reason let you *connect* to them, but then hang when sent requests. If you're simultaneously requesting stuff from a very large number of different sites, this is sort of inevitably bound to happen, either based on network problems, or issues with the sites themselves. I suspect you thought that the connectTimeout argument to Agent would save you from this, but that timeout is just for making the initial underlying TCP connection, not receiving a full response. What you actually want to do is cancel the Deferred returned by Agent.request.
Luckily, https://treq.readthedocs.io/en/latest/ already implements this high-level timeout functionality for you, in the form of the 'timeout=' argument it accepts. If you give that a try, do you see more connections timing out as it runs, rather than "hanging" the process for long periods of time?
As long as I'm looking at your code, as a way of thanking you for providing such a nice specific runnable example, I have a few other random thoughts which may be useful to you:
- I see you're importing psycopg. Do you know about https://txpostgres. readthedocs.io/en/latest/ ? You can talk to postgres asynchronously with Twisted. - d.addCallback(lambda out: out).addCallback(lambda resp: client.readBody(resp)) can be much more briefly spelled "d.addCallback(client.readBody)". d.addErrback(lambda err: err) does nothing and can just be removed. - BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that. - clean_up_and_exit will only be called if batchesDef doesn't fail, and if it does fail, it will swallow the exception message. Rather than manually calling `reactor.stop`, you probably want to use react(), < https://twistedmatrix.com/documents/16.3.0/api/twisted. internet.task.html#react>. This way your function is an API that anyone who wants to use it can call - it just returns a Deferred when it's done - but your __main__ block calls react() which will both start and stop the reactor, as well as reporting errors if there's a problem while still shutting down.
Hope some of that code review is helpful - let us know if the treq timeout solves the problem or if the issue is somewhere else!
-glyph
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
participants (3)
-
Glyph Lefkowitz -
Manish Tomar -
Randomcoder