Limit the simultaneous twisted.web.client.downloadPage requests
Hello! I am a newbie in twisted, sorry if my question sounds awkward. I have written a pretty simple recursive page downloader, which parses an html, extracts all the needed links from it, and starts dowloading them. The links are the videofiles, so they are pretty large. The problem is, that the downloader works TOO FAST :) I want to set something like the global bandwidth limit or the maximum limit of concurrently downloading files. I am using the twisted.web.client.downloadPage to download the files and using the Deferred, that it returns. I can't understand how to make it still return a Deferred, corresponding to that file, but not start downloading right away, but instead start downloading it on some kind of event (make a manger-like wrapper for that function). So I want the code to still look simple like this: for link in links: d = downloadPage_limited(link, filename) And the wrapper(function downloadPage_limited) will manage the amount of concurrent downloads, and will still return the Deferred, which will be returned by twisted.web.client.downloadPage. Is my idea about a "wrapper" practical and what's the general way to write it? On which event is it better to decrement the counter of the amount currently downloading files? Hope it is clear enough. Thanks in advance, Igor Katson.
On 01:20 pm, descentspb@gmail.com wrote:
Hello!
I am a newbie in twisted, sorry if my question sounds awkward.
I have written a pretty simple recursive page downloader, which parses an html, extracts all the needed links from it, and starts dowloading them. The links are the videofiles, so they are pretty large. The problem is, that the downloader works TOO FAST :) I want to set something like the global bandwidth limit or the maximum limit of concurrently downloading files.
I am using the twisted.web.client.downloadPage to download the files and using the Deferred, that it returns. I can't understand how to make it still return a Deferred, corresponding to that file, but not start downloading right away, but instead start downloading it on some kind of event (make a manger-like wrapper for that function).
So I want the code to still look simple like this:
for link in links: d = downloadPage_limited(link, filename)
And the wrapper(function downloadPage_limited) will manage the amount of concurrent downloads, and will still return the Deferred, which will be returned by twisted.web.client.downloadPage.
Is my idea about a "wrapper" practical and what's the general way to write it? On which event is it better to decrement the counter of the amount currently downloading files?
Yes, that's a good idea. You might be able to use twisted.internet.defer.DeferredSemaphore to handle all of the counting for you. For example, from twisted.internet.defer import DeferredSemaphore from twisted.web.client import downloadPage class LimitedDownloader: def __init__(self, howMany): self._semaphore = DeferredSemaphore(howMany) def downloadPage(self, *a, **kw): return self._semaphore.run(downloadPage, *a, **kw) downloader = LimitedDownloader(3) downloader.downloadPage(...) In this example, DeferredSemaphore.run will only let 3 downloadPage calls run concurrently. If a 4th is attempted before any earlier ones finish, it won't actually be called until one of the earlier ones does finish, and then it will be called. Jean-Paul
exarkun@twistedmatrix.com wrote:
On 01:20 pm, descentspb@gmail.com wrote:
Hello!
I am a newbie in twisted, sorry if my question sounds awkward.
I have written a pretty simple recursive page downloader, which parses an html, extracts all the needed links from it, and starts dowloading them. The links are the videofiles, so they are pretty large. The problem is, that the downloader works TOO FAST :) I want to set something like the global bandwidth limit or the maximum limit of concurrently downloading files.
I am using the twisted.web.client.downloadPage to download the files and using the Deferred, that it returns. I can't understand how to make it still return a Deferred, corresponding to that file, but not start downloading right away, but instead start downloading it on some kind of event (make a manger-like wrapper for that function).
So I want the code to still look simple like this:
for link in links: d = downloadPage_limited(link, filename)
And the wrapper(function downloadPage_limited) will manage the amount of concurrent downloads, and will still return the Deferred, which will be returned by twisted.web.client.downloadPage.
Is my idea about a "wrapper" practical and what's the general way to write it? On which event is it better to decrement the counter of the amount currently downloading files?
Yes, that's a good idea.
You might be able to use twisted.internet.defer.DeferredSemaphore to handle all of the counting for you. For example,
from twisted.internet.defer import DeferredSemaphore from twisted.web.client import downloadPage
class LimitedDownloader: def __init__(self, howMany): self._semaphore = DeferredSemaphore(howMany)
def downloadPage(self, *a, **kw): return self._semaphore.run(downloadPage, *a, **kw)
downloader = LimitedDownloader(3) downloader.downloadPage(...)
In this example, DeferredSemaphore.run will only let 3 downloadPage calls run concurrently. If a 4th is attempted before any earlier ones finish, it won't actually be called until one of the earlier ones does finish, and then it will be called.
Thanks for quick and great help, Terry and Jean-Paul!
On Oct 24, 2009, at 6:20 AM, Igor Katson wrote:
Is my idea about a "wrapper" practical and what's the general way to write it? On which event is it better to decrement the counter of the amount currently downloading files?
Another way that you might do this is by using this small snippet: http://bitbucket.org/adroll/turtl/src/tip/turtl/engine.py turtl is a project that you can use either as a proxy server in front of all of your clients (and throttle the requests on a url basis) or embedded in your system like this: from turtl import engine thr = engine.ThrottlingDeferred(parallelism, calls, interval) dl = [thr.run(callable, *args, **kwargs) for args in self.args] defer.DeferredList(dl).addBoth(lambda _: reactor.stop ()) In this case you get a lot more control over the number of calls in a given interval that you are allowed to make (for example Amazon Alexa allows only 15 calls per second). -- Valentino Volonghi aka Dialtone Now Running MacOSX 10.6 http://www.adroll.com/
participants (3)
-
exarkun@twistedmatrix.com
-
Igor Katson
-
Valentino Volonghi