[Baypiggies] web scraping best practice question

Thomas Belote tbelote at tombelote.com
Mon Nov 2 20:27:23 CET 2009


When crawling, I would also check robots.txt for the crawl-delay  
directive. Otherwise I think your rate limiting is more than  
sufficient, most search engines are often more aggressive than what  
you have below.


On Nov 2, 2009, at 11:22 AM, Isaac wrote:

> Hello Baypiggies.
>
> I wrote a Python script to send a query to a single website. I am
> curious: what is the best practice for the rate of sending requests
> when scraping a single site? I'll have about 4000 requests.
> I thought about _politely_ writing:
>
> import random
> for x in large_query_list:
>    send_scrap_query(x)
>    t = random.randint(1, 5)
>    sleep(t)
>
> to pause for a psuedo-random duration between each request- so I don't
> put strain on anyone's network. Does anyone have recommendations for
> best practices regarding rete of sending a set of queries? I missed
> the talk about web scraping from the beginning of the year.
>
> -Isaac
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies



More information about the Baypiggies mailing list