[Catalog-sig] Proposal: Move PyPI static data to the cloud for better availability

Tue Jun 15 13:49:03 CEST 2010

As mentioned, I've been working on a proposal text for the cloud idea.
Here's a first draft. Please have a look and let me know whether I've
missed any important facts. Thanks.

I intend to post the proposal to the PSF board (of which I'm a member,
in case you shouldn't know) and to have it vote on the proposal in one
of the next board meetings.

"""
PSF-Proposal: 100
Title: Move PyPI static data to the cloud for better availability
Version: Draft 1
Last-Modified: 2010-06-15
Author: mal at lemburg.com (Marc-André Lemburg)
Discussions-To: catalog-sig at python.org
Status: Draft
Type: Informational
Created: 2010-06-14
Post-History:

Proposal: Move PyPI static data to the cloud for better availability
========================================================================

Motivation
----------

PyPI has in recent months seen several outages with the index not
being unavailable to both users using the web GUI interface as well as
package administration tools such as easy_install from setuptools.

As more and more Python applications rely on tools such as
easy_install for direct installation, or zc.buildout to manage the
complete software configuration cycle, the PyPI infrastructure
receives more and more attention from the Python community.

In order to maintain its credibility as software repository, to
support the many different projects relying on the PyPI infrastructure
and the many users who rely on the simplified installation process
enabled by PyPI, the PSF needs to take action and move the essential
parts of PyPI to a more robust infrastructur that provides:

 * scalability
 * 24/7 system administration management
 * geo-localized fast and reliable access

Current Situation
-----------------

PyPI is currently run from a single server hosted in The Netherlands
(ximinez.python.org).  This server is run by a very small team of sys
admin.

PyPI itself has in recent months been mostly maintained by one
developer: Martin von Loewis.  Projects are underway to enhance PyPI
in various ways, including a proposal to add external mirroring (PEP
381), but these are all far from being finalized or implemented.

Usage
-----

PyPI provides four different mechanisms for accessing the stored
information:

 * a web GUI that is meant for use by humans
 * an RPC interface which is mostly used for uploading new
   content
 * a semi-static /simple package listing, used by setuptools
 * a static area /packages for package download files and
   documentation, used by both the web GUI and setuptools

The /simple package listing is dump of all packages in PyPI using a
simple HTML page with links to sub-pages for each package. These
sub-pages provide links to download files and external references.

External tools like easy_install only use the /simple package
listing together with the hosted package download files.

While the /simple package listing is currently dynamically created
from the database in real-time, this is not really needed for normal
operation. A static copy created every 10-20 minutes would provide the
same level of service in much the same way.

Moving static data to a CDN
---------------------------

Under the proposal the static information stored in PyPI
(meta-information as well as package download files and documentation)
is moved to a content delivery network (CDN).

For this purpose, the /simple package listing is replaced with a
static copy that is recreated every 10-20 minutes using a cronjob on
the PyPI server.

At the same intervals, another script will scan the package and
documentation files under /packages for updates and upload any changes
to the CDN for neartime availability.

By using a CDN the PSF will enable and provide:

 * high availability of the static PyPI content
 * offload management to the CDN
 * enable geo-localized downloads, i.e. the files are hosted
   on a nearby server
 * faster downloads
 * more reliability and scalability
 * move away from a single point of failure setup

Note that the proposal does not cover distribution of the dynamic
parts of PyPI. As a result uploads to PyPI may still fail if the PyPI
server goes down. However, these dynamic parts are currently not being
used by the existing package installation tools.

Choice of CDN: Amazon Cloudfront
--------------------------------

To keep the costs low for the PSF, Amazon Cloudfront appears to be
the bext choice for CDN.

Cloudfront is supported by a set of Python libraries (e.g. Amazon S3
lib and boto), upload scripts are readily available and can easily be
customized.

 http://www.saltycrane.com/blog/2008/12/card-store-project-4-notes-using-amazons-cloudfront/

Other CDNs, such as Akamai, are either more expensive or require
custom integration.  Availability of Python-based tools is not always
given, in fact, accessing such information is difficult for most of
the proporietary CDNs.

Cloudfront: quality of service
------------------------------

Amazon Cloudfront uses S3 as basis for the service, S3 has been around
for years and has a very stable uptime:

 http://www.readwriteweb.com/archives/amazon_s3_exceeds_9999_percent_uptime.php

Cloudfront itself has been around since Nov 2008.

You can check their current online status using this panel:

 http://status.aws.amazon.com/

Apart from the gained availability and outsourced management, we'd
also get faster downloads in most parts of the world, due to the local
caching Cloudfront is applying. This caching can be used to further
increase the availability, since we can control the expiry time of
those local copies.

So in summary we are replacing a single point of failure with N points
of failure (with N being the number of edge caching servers they use).

How Cloudfront works
--------------------

Cloudfront uses Amazon's S3 storage system which is based on
"buckets".  These can store any number of files in a directory-like
structure. The only limit is a 5GB per file limit - more than enough
for any PyPI package file.

Cloudfront provides a domain for each registered S3 bucket via a
"distribution" which is then made available through local cache
servers in various locations around the world. The management of which
server to use for an incoming request is transparently handled by
Amazon. Once uploaded to the S3 bucket, the files will be distributed
to the cache servers on demand and as necessary.

Each edge server server maintains a cache of requested files and
refetches the files after an expiry time which can be defined when
uploading the file to the bucket.

To simplify things on our side, we'll setup a CNAME DNS alias
for the Cloudfront domain issued by Amazon to our bucket:

 pypi-static.python.org. IN CNAME d32z1yuk7jeryy.cloudfront.net.

For more details, please see the Cloudfront documentation:

 http://aws.amazon.com/documentation/cloudfront/

Integration
-----------

In order to keep the number of changes to existing client side tools
and PyPI itself to a minimum, the installation will try to be as
transparent to both the server and the client side as possible.

This requires on the server side:

 * few, if any changes to the PyPI code base
 * simple scripts, driven by cronjobs
 * a simple distributed redirection setup to avoid having
   to change client side tools

On the client side:

 * no need to change the existing URL http://pypi.python.org/simple
   to access PyPI
 * redirects are already supported by setuptools via urllib2

Server side: upload cronjobs
----------------------------

Since the /simple index tree is currently being created dynamically,
we'd need to create static copies of it at regular intervals in order
to upload the content to the S3 bucket. This can easily be done using
tools such as wget or curl.

Both the static copy of the /simple tree and the static files uploaded
to /packages then need to be uploaded or updated in the S3 bucket by a
cronjob running every 10-20 minutes.

Server side: downloads statistics
---------------------------------

The next step would then be to configure access logs:

 http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html

and add a cronjob to download them to the PyPI server.

Since the format is a bit different than the Apache log format used by
the PyPI software, we'd have two options:

 1. convert the Cloudfront format to Apache format and simply
    append the converted logs to the local log files

 2. write a Cloudfront log file reader and add it to the
    apache_count_dist.py script that updates the download
    counts on the web GUI

Both options require no more than a few hours to implement and test.

Server side: redirection setup
------------------------------

Since PyPI wasn't designed to be put on a CDN, it mixes static file
URL paths with dynamic access ones, e.g.

dynamic:

 http://pypi.python.org/pypi
 (and a few others)

static:

 http://pypi.python.org/simple
 http://pypi.python.org/packages

To move part of the URL path tree to a CDN, which works based on
domains, we will need to provide a URL redirection setup that
redirects client side tools to the new location.

As Martin von Loewis mentioned, this will require distributing the
redirection setup to more than just one server as well.

Fortunately, this is not difficult to do: it requires a preconfigured
lighttpd (*) setup running on N different servers which then all
provide the necessary redirections (and nothing more):

dynamic:

 http://pypi.python.org/ -> http://ximinez.python.org/pypi
 http://pypi.python.org/pypi -> http://ximinez.python.org/pypi
 (and possibly a few others)

static:

 http://pypi.python.org/simple -> http://pypi-static.python.org/simple
 http://pypi.python.org/packages -> http://pypi-static.python.org/packages
 http://pypi.python.org/documentation -> http://pypi-static.python.org/documentation
 (note: pypi-static.python.org is a CNAME alias for the Cloudfront
  domain issued to the S3 bucket where we upload the data)

The pypi.python.org domain would then have to be setup to map to
multiple IP addresses via DNS round-robin, one entry for each
redirection server, e.g.

 pypi.python.org. IN A 123.123.123.1
 pypi.python.org. IN A 123.123.123.1
 pypi.python.org. IN A 123.123.123.3
 pypi.python.org. IN A 123.123.123.4

Redirection servers could be run on all PSF server machines, and, to
increase availability, on PSF partner servers as well.

(*) lighttpd is a lightwheight and fast HTTP server. It's easy to
setup, doesn't require a lot of resources on the server machine and
runs stable.

Long-term changes
-----------------

While enabling the above redirection setup, we should also start
working on changing PyPI and the client tools to use two new domains
which then cleanly separate the static CDN file access from the
dynamic PyPI server access:

 pypi.python.org
 pypi-static.python.org

Such a transition on the client side is expected to take at least a
few years. After that, the redirection service can be shut down or
used to distribute and scale the dynamic PyPI service parts.

Side-effects
------------

Restarts of the PyPI server, network outages, or hardware failures
would not affect the static copies of the PyPI on the CDN. setuptools,
easy_install, pip, zc.buildout, etc. would continue to work.

The S3 bucket would serve as additional backup for the files on PyPI.

Later intergration with Amazon EC2 (their virtual server offering)
would easily be possible for more scalability and reduced system
administration load.

Costs
-----

Amazon charges for S3 and Cloudfront storage, transfer and access. The
costs vary depending on location.

 http://aws.amazon.com/cloudfront/#pricing
 http://aws.amazon.com/s3/#pricing

To get an idea of the costs, we'd have to take a closer look at
the PyPI web stats:

 http://pypi.python.org/webstats/usage_201005.html

In May 2010, PyPI transferred 819GB data and had to handle 22mio
requests.

Using the AWS monthly calculator this gives roughly (I used 37KB as
average object size and 35% US, 35% EU, 10% HK, 10% JP as basis): USD
132 per month, or about USD 1,600 per year.

Refinancing the costs
---------------------

Since PyPI is being used as essential resource by many important
Python projects (Zope, Plone, Django, etc.), it's fair to ask the
respective foundations and the general Python community for donations
to help refinance the administration costs.

A prominent donation button should go the PyPI page with a text
explaining how PyPI is being hosted and why donations are necessary.

We may also be able to directly ask for donations from the above
foundations. Details of this are currently being evaluated by the PSF
board (there are some issues related to our non-profit status that
make this more complicated than it appears at first).

Effort
------

Given that most of the tools are readily available, setting up the
servers shouldn't take more than 2-3 developer days for developers
who've worked with Amazon S3 and Cloudfront before, including testing.

It is expected that we'll find volunteers to implement the necessary
changes.

"""

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 15 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2010-07-19: EuroPython 2010, Birmingham, UK                33 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/