Ian Bicking wrote:
Rick Warner wrote:
Actually, PyPI is replicated. See, for example, http://download.zope.org/simple/.
It may be that some of the mirrors should be better advertised.
A half-hearted effort. at best, after the problems last year. When I configure a CPAN client (once per user) I create a list of replicas I want to search for any query from a list of hundreds of replicas distributed around the world.
Can someone suggest the best way to search among repositories? For instance, try to connect to one, then stop if it gives Connection Refused? If it gives any unexpected error (5xx)? Timing out is a common failure, and a pain in the butt, but I guess there's that too. What does the CPAN client do?
I don't know what CPAN does but Linux distributions have also solved this problem. We send out massive numbers of updates and new packages to users every day so we need a mirror network that works well.
In Fedora we have a server that gives out a list of mirrors with GeoIP data used to try and assemble a list of mirrors near you (country, then continent (with special cases, for instance, for certain middle eastern countries that connect better to Europe than to Asia) and then global).
This server gives the mirror list out (randomized among the close mirrors) and the client goes through the list, trying to retrieve package metadata. If it times out or otherwise fails, then it goes on to the next mirror until it gets data. (Note, some alternate clients are able to download from multiple servers at the same time if multiple packages are needed.)
The mirrorlist server is a pretty neat application (https://fedorahosted.org/mirrormanager). It has a TurboGears front end that allows people to add a new mirror (https://admin.fedoraproject.org/mirrormanager) for public availability or restricted to a subset of IPs. It allows you to only mirror a subset of the whole content. And it has several methods of telling if the mirror is in sync or outdated. The latter is important to us for making sure we're giving out users the latest updates that we've shipped and ranges from a script that the mirror admin can run from their cron job to check the data available and report back to a process run on our servers to check that the mirrors have up to date content. The mirrorlist itself is cached and served from a mod_python script (soon to be mod_wsgi) for speed.
You might also be interested in the way that we work with package metadata. In Fedora and many other rpm-based distributions (Some Debian-based distros talked about this as well but I don't know if it was ever implemented there) we create static xml files (and recently, sqlite dbs as well) that live on the mirrors. The client hits the mirror and downloads at least two of these files. The repomd.xml file describes the other files with checksums and is used to verify that the other metadata is up to date and whether anything has changed. The primary.xml file stores information that is generally what is needed for doing depsolving on the packages. Then we have several other xml files that collectively contain the complete metadata for the packages but is usually overkill... by separating htis stuff out, we save clients from having to download it in the common case. This stuff could provide some design ideas for constructing a pypi metadata repository and is documented here: http://createrepo.baseurl.org/
Note: the reason we went with static metadata rather than some sort of cgi script is that static data can be mirrored without the mirror being required to run anything beyond a simple rsync cron job. This makes finding mirrors much easier.