[Catalog-sig] Proposal: Move PyPI static data to the cloud for better availability

Tue Jun 15 16:33:45 CEST 2010

On Tue, 15 Jun 2010 09:49:03 pm M.-A. Lemburg wrote:
> As mentioned, I've been working on a proposal text for the cloud
> idea. Here's a first draft. Please have a look and let me know
> whether I've missed any important facts. Thanks.

I think the most important missed fact is, just how unreliable is PyPI 
currently? Does anyone know?

I know there's a number of people complaining that it's down "all the 
time", or even occasionally, but I think that we need to know the 
magnitude of the problem that needs solving. What's the average length 
of time between outages? What's the average length of the outage? Just 
saying that there's been several outages in recent months is awfully 
hand-wavy.

[...]
> Amazon Cloudfront uses S3 as basis for the service, S3 has been
> around for years and has a very stable uptime:
>
> http://www.readwriteweb.com/archives/amazon_s3_exceeds_9999_percent_u
>ptime.php

Is there anyone here who has personal experience with Cloudfront and is 
willing to vouch for it? Or argue against it? We can only go so far 
based on Amazon's marketing material.

One thing that does worry me:

> So in summary we are replacing a single point of failure with N
> points of failure (with N being the number of edge caching servers
> they use).

I don't think this means what you seem to think it means. If you replace 
a single point of failure with N points of failure, your overall 
reliability goes down, not up, since there are now more things to go 
wrong. Assuming that they're independent points of failure, that means 
your total number of failures will increase by a factor of N.

For example, if a single edge server in (say) Australia goes down, 
Amazon might not count it as an outage for the purpose of calculating 
their 99.99% reliability since the system as a whole is still up, but 
conceivably Australian users might see an outage (or at least a 
slow-down). With N servers, I'd expect N times the number of individual 
outages, with Amazon presumably only counting it as "system down" if 
all N servers go down at the same time.

-- 
Steven D'Aprano