Re: [Distutils] "Python Package Management Sucks"

1 Oct 2008

      Ian Bicking wrote:
...
Rick Warner wrote:
...
...
Actually, PyPI is replicated.  See, for example,
http://download.zope.org/simple/.
It may be that some of the mirrors should be better advertised.
A half-hearted effort. at best, after the problems last year.  When I
configure a CPAN client (once per user) I create a list of replicas I
want to search for any query from a list of hundreds of  replicas
distributed around the world.
Can someone suggest the best way to search among repositories?  For
instance, try to connect to one, then stop if it gives Connection
Refused?  If it gives any unexpected error (5xx)?  Timing out is a
common failure, and a pain in the butt, but I guess there's that too.
What does the CPAN client do?
I don't know what CPAN does but Linux distributions have also solved
this problem.  We send out massive numbers of updates and new packages
to users every day so we need a mirror network that works well.

In Fedora we have a server that gives out a list of mirrors with GeoIP
data used to try and assemble a list of mirrors near you (country, then
continent (with special cases, for instance, for certain middle eastern
countries that connect better to Europe than to Asia) and then global).

This server gives the mirror list out (randomized among the close
mirrors) and the client goes through the list, trying to retrieve
package metadata.  If it times out or otherwise fails, then it goes on
to the next mirror until it gets data.  (Note, some alternate clients
are able to download from multiple servers at the same time if multiple
packages are needed.)

The mirrorlist server is a pretty neat application
(https://fedorahosted.org/mirrormanager).  It has a TurboGears front end
that allows people to add a new mirror
(https://admin.fedoraproject.org/mirrormanager) for public availability
or restricted to a subset of IPs.  It allows you to only mirror a subset
of the whole content.  And it has several methods of telling if the
mirror is in sync or outdated.  The latter is important to us for making
sure we're giving out users the latest updates that we've shipped and
ranges from a script that the mirror admin can run from their cron job
to check the data available and report back to a process run on our
servers to check that the mirrors have up to date content.  The
mirrorlist itself is cached and served from a mod_python script (soon to
be mod_wsgi) for speed.

You might also be interested in the way that we work with package
metadata.  In Fedora and many other rpm-based distributions (Some
Debian-based distros talked about this as well but I don't know if it
was ever implemented there) we create static xml files (and recently,
sqlite dbs as well) that live on the mirrors.  The client hits the
mirror and downloads at least two of these files.  The repomd.xml file
describes the other files with checksums and is used to verify that the
other metadata is up to date and whether anything has changed.  The
primary.xml file stores information that is generally what is needed for
doing depsolving on the packages.  Then we have several other xml files
that collectively contain the complete metadata for the packages but is
usually overkill... by separating htis stuff out, we save clients from
having to download it in the common case.  This stuff could provide some
design ideas for constructing a pypi metadata repository and is
documented here:  http://createrepo.baseurl.org/

Note: the reason we went with static metadata rather than some sort of
cgi script is that static data can be mirrored without the mirror being
required to run anything beyond a simple rsync cron job.  This makes
finding mirrors much easier.

-Toshio

Re: [Distutils] "Python Package Management Sucks"

Toshio Kuratomi