Rate limiting a web crawler
Terry Reedy
tjreedy at udel.edu
Wed Dec 26 14:04:14 EST 2018
On 12/26/2018 10:35 AM, Simon Connah wrote:
> Hi,
>
> I want to build a simple web crawler. I know how I am going to do it but
> I have one problem.
>
> Obviously I don't want to negatively impact any of the websites that I
> am crawling so I want to implement some form of rate limiting of HTTP
> requests to specific domain names.
>
> What I'd like is some form of timer which calls a piece of code say
> every 5 seconds or something and that code is what goes off and crawls
> the website.
>
> I'm just not sure on the best way to call code based on a timer.
>
> Could anyone offer some advice on the best way to do this? It will be
> running on Linux and using the python-daemon library to run it as a
> service and will be using at least Python 3.6.
You can use asyncio to make repeated non-blocking requests to a web site
at timed intervals and to work with multiple websites at once. You can
do the same with tkinter except that requests would block until a
response unless you implemented your own polling.
--
Terry Jan Reedy
More information about the Python-list
mailing list