[Catalog-sig] PyPI and Wiki crawling, and a CDN

Ben Bangert ben at groovie.org
Sun Aug 12 21:07:53 CEST 2007


On Aug 7, 2007, at 2:06 PM, Martin v. Löwis wrote:

> I hope I have now solved the overload problem that massive
> crawling has caused to the wiki, and, in consequence,
> caused PyPI outage.
>
> Following Laura's advice, I added Crawl-delay into robots.txt.
> Several robots have picked that up, not just msnbot and slurp,
> but also e.g. MJ12bot.
>
> For the others, I had to fine-tune my throttling code, after
> observing that the expensive URLs are those with a query string.
> They now account for 3 regular queries (might have to bump this
> to 5), so you can only do one of them every 6s.

I don't suppose there's enough resources to just have PyPI on a  
separate box entirely, so that whatever else is running (the wiki,  
etc) won't have the opportunity to drag down the package repository?

On a side-note, has anyone checked into a CDN for packages to speed  
up their delivery and remove more of the traffic load off the PyPi  
host? That would also lower the bar for other sites that wanted to  
mirror PyPI, since they wouldn't have to hose all the actual egg's as  
well.

Cheers,
Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2472 bytes
Desc: not available
Url : http://mail.python.org/pipermail/catalog-sig/attachments/20070812/7ee6ccc3/attachment.bin 


More information about the Catalog-SIG mailing list