[Catalog-sig] Proposal: Move PyPI static data to the cloud for better availability (version 2)

Ian Bicking ianb at colorstudy.com
Tue Jun 29 18:54:03 CEST 2010

A few notes:

On Tue, Jun 29, 2010 at 9:39 AM, M.-A. Lemburg <mal at egenix.com> wrote:

> In order to maintain its credibility as software repository, to
> support the many different projects relying on the PyPI infrastructure
> and the many users who rely on the simplified installation process
> enabled by PyPI, the PSF needs to take action and move the essential
> parts of PyPI to a more robust infrastructur that provides:
>  * scalability
>  * 24/7 outsourced system administration management

In a sense a CDN offers outsourced system administration -- if you upload
content, they are responsible for making sure it gets served up.  But that's
all they do.

Other "cloud" systems only provide system administration for infrastructure
issues, like a network routing issue.  They do not provide anything on your
machine itself.  It is possible to get hosting with system administration
included, Rackspace Managed Servers are an example, but these are quite
expensive -- basically you are paying an overhead on hosting to have a
competent sysadmin on hand.

> -----
> PyPI provides four different mechanisms for accessing the stored
> information:
>  * a web GUI that is meant for use by humans
>  * an RPC interface which is mostly used for uploading new
>   content
>  * a semi-static /simple package listing, used by setuptools
>  * a static area /packages for package download files and
>   documentation, used by both the web GUI and setuptools

The static packages are used by the RPC (setup.py upload) and automatically
linked in.  There is no privileged aspect to them, Setuptools
(easy_install/pip) just reads the links provided, and if they happen to
point to pypi packages then that's what is fetched.  I mention this because
changing those URLs on the server side will be easy as a result.

> The /simple package listing is dump of all packages in PyPI using a
> simple HTML page with links to sub-pages for each package. These
> sub-pages provide links to download files and external references.
> External tools like easy_install only use the /simple package
> listing together with the hosted package download files.
> While the /simple package listing is currently dynamically created
> from the database in real-time, this is not really needed for normal
> operation. A static copy created every 10-20 minutes would provide the
> same level of service in much the same way.
> Moving static data to a CDN
> ---------------------------
> Under the proposal the static information stored in PyPI
> (meta-information as well as package download files and documentation)
> is moved to a content delivery network (CDN).
> For this purpose, the /simple package listing is replaced with a
> static copy that is recreated every 10-20 minutes using a cronjob on
> the PyPI server.
> At the same intervals, another script will scan the package and
> documentation files under /packages for updates and upload any changes
> to the CDN for neartime availability.

I disagree with this part of the proposal, because I think a 10-20 minute
delay introduces the possibility of invisible errors (an infinite delay),
and represents a real degradation of service as new versions of packages
will not be installable until after regeneration.  Also I think the RPC code
(what is invoked with setup.py register/upload) can regenerate these static
pages immediately.

Uploading to a CDN may have to be asynchronous, but to keep the data robust
we should really be storing the package locally and adding a new field to
point to the mirrored location (i.e., the CDN URL).  When the cron job runs
that field can be updated.  If the CDN upload fails (which is not unlikely)
then PyPI can keep using the local location.  The cron job would then also
be triggering another regeneration of the static file in /static, but so
long as you are only regenerating on changes this isn't much overhead.

Also, making upload/register a synchronous operation will slow down the
speed of RPC commands, but I don't think this is a problem -- I would much
rather have an upload be slow to finish than fast but not know when the
result will be available.  I don't know what kind of latency to expect,

Also, I'd like to offer a counterproposal that does not use a CDN:

* Have PyPI write out static files *locally*
* Use rewrite rules so those files get served without touching PyPI.
* Move the PyPI installation to mod_wsgi (I believe it is using FCGI now?),
with conservative settings for things like MaxRequests.  I believe this will
significantly improve the problem of PyPI taking down Apache, which means
the static files will still be available even if PyPI itself is down.

This is largely work that would have to happen to move to a CDN, but it's
simpler (given how PyPI works now) and I believe will relieve most of the
problems we've seen.  PyPI right now is really quite reliable, these small
changes would I think be low-risk and less likely to introduce new problems
while addressing what I suspect is the source of problems.

Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20100629/d415914b/attachment-0001.html>

More information about the Catalog-SIG mailing list