[Catalog-sig] Proposal: Move PyPI static data to the cloud for better availability

Ian Bicking ianb at colorstudy.com
Wed Jun 16 00:47:57 CEST 2010

Hmm... long thread.

Anyway: I'm +1 on using a CDN.  I think the overhead of managing a mirror
network is considerably greater than the cost of the CDN, and more
error-prone.  With a CDN one developer can figure out how to implement this
in PyPI, and any problems will be with PyPI, not some other mirror system
that the person debugging the problem doesn't control.

I think your cost only covers bandwidth, but there are also storage costs.
What disk space are the PyPI packages using right now?  That will only
increase over time as PyPI generally keeps all releases.  Possibly CDN space
could be donated.  As an implementation note, Google's new system copies
S3's API (http://code.google.com/apis/storage/) -- I'm not sure if it covers
the same territory as CloudFront though.  Anyway, implementing to
S3/CloudFront probably is a good bet even if the provider changes in the

For generation /simple/ with a cronjob, I'm -0.  I find these delays make
testing difficult and unreliable; you can never be sure if the job is just
slow, what you did didn't work, etc.  I'd rather see PyPI shift to creating
static pages on-demand, that is, anytime they need updating.  Then if PyPI
goes down the static pages still exist and work, but there's no delay.
Another option might be a caching proxy configured to serve up cached copies
when the underlying system is down... but I'm not sure if that's any less
work ultimately, and is more ongoing administration.

I don't see a benefit to moving further into the cloud, such as hosting on
multiple machines.  I suspect that PyPI is not anywhere near needing more
power than a good sized server can provide, and I doubt that will change
soon.  It will be easier to manage the system with a single machine and
database.  There won't be network problems where app servers can't access
the database, for instance.  Or a need for replication, which is another big
potential administration hassle.

>  * scalability
>  * 24/7 system administration management
>  * geo-localized fast and reliable access
> Current Situation
> -----------------
> PyPI is currently run from a single server hosted in The Netherlands
> (ximinez.python.org).  This server is run by a very small team of sys
> admin.

As far as I know, none of this changes how much administration load there
is, does it?  That is, cloud machines still need to be administered.  The
only way I see that you'd really decrease administration load is with a more
radical move to a managed service, like App Engine.  That's probably quite
doable and would have substantial advantages, but it feels like a quite
different approach than is proposed here and it involves lots more coding.

Unless there really is a problem with the physical management of the server?

Server side: upload cronjobs
> ----------------------------
> Since the /simple index tree is currently being created dynamically,
> we'd need to create static copies of it at regular intervals in order
> to upload the content to the S3 bucket. This can easily be done using
> tools such as wget or curl.
> Both the static copy of the /simple tree and the static files uploaded
> to /packages then need to be uploaded or updated in the S3 bucket by a
> cronjob running every 10-20 minutes.

Is it easy to sync something with S3?  It's easy to upload, delete, etc.,
but sync is rather different, no?  Not a big deal, just that changes would
have to be tracked if sync was not efficient.

> Server side: redirection setup
> ------------------------------
> Since PyPI wasn't designed to be put on a CDN, it mixes static file
> URL paths with dynamic access ones, e.g.
> dynamic:
>  http://pypi.python.org/pypi
>  (and a few others)
> static:
>  http://pypi.python.org/simple
>  http://pypi.python.org/packages
> To move part of the URL path tree to a CDN, which works based on
> domains, we will need to provide a URL redirection setup that
> redirects client side tools to the new location.

As far as I know /packages isn't accessed directly, but only from links from
/simple -- so if those links are updated everything should work.  Some
packages already aren't on PyPI, so there's no particular expectation about
hosting location.

If /simple/ is a set of static files hosted on ximinez, will it be reliable
enough?  Then no redirects will be required.  I don't know what exactly has
caused failures.  If it's networking then redirects would help.  If it's
services failing, then static files will solve it.  If it's the entire
machine getting wonky, e.g., if memory is exhausted... then quite possible
static files will help avoid those situations but it's not a guarantee.

Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20100615/c6a98bc6/attachment.html>

More information about the Catalog-SIG mailing list