[Catalog-sig] PyPI and Wiki crawling, and a CDN
ben at groovie.org
Sun Aug 12 21:07:53 CEST 2007
On Aug 7, 2007, at 2:06 PM, Martin v. Löwis wrote:
> I hope I have now solved the overload problem that massive
> crawling has caused to the wiki, and, in consequence,
> caused PyPI outage.
> Following Laura's advice, I added Crawl-delay into robots.txt.
> Several robots have picked that up, not just msnbot and slurp,
> but also e.g. MJ12bot.
> For the others, I had to fine-tune my throttling code, after
> observing that the expensive URLs are those with a query string.
> They now account for 3 regular queries (might have to bump this
> to 5), so you can only do one of them every 6s.
I don't suppose there's enough resources to just have PyPI on a
separate box entirely, so that whatever else is running (the wiki,
etc) won't have the opportunity to drag down the package repository?
On a side-note, has anyone checked into a CDN for packages to speed
up their delivery and remove more of the traffic load off the PyPi
host? That would also lower the bar for other sites that wanted to
mirror PyPI, since they wouldn't have to hose all the actual egg's as
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 2472 bytes
Desc: not available
Url : http://mail.python.org/pipermail/catalog-sig/attachments/20070812/7ee6ccc3/attachment.bin
More information about the Catalog-SIG