[Baypiggies] web scraping best practice question
Thomas Belote
tbelote at tombelote.com
Mon Nov 2 20:27:23 CET 2009
When crawling, I would also check robots.txt for the crawl-delay
directive. Otherwise I think your rate limiting is more than
sufficient, most search engines are often more aggressive than what
you have below.
On Nov 2, 2009, at 11:22 AM, Isaac wrote:
> Hello Baypiggies.
>
> I wrote a Python script to send a query to a single website. I am
> curious: what is the best practice for the rate of sending requests
> when scraping a single site? I'll have about 4000 requests.
> I thought about _politely_ writing:
>
> import random
> for x in large_query_list:
> send_scrap_query(x)
> t = random.randint(1, 5)
> sleep(t)
>
> to pause for a psuedo-random duration between each request- so I don't
> put strain on anyone's network. Does anyone have recommendations for
> best practices regarding rete of sending a set of queries? I missed
> the talk about web scraping from the beginning of the year.
>
> -Isaac
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
More information about the Baypiggies
mailing list