[Distutils] Python people want CPAN and how the latter came about

Sridhar Ratnakumar sridharr at activestate.com
Fri Dec 25 06:09:04 CET 2009


On 12/24/2009 3:00 AM, "Martin v. Löwis" wrote:
>> Some reasons to have PyPI host packages have already been mentioned in
>> >  this thread: it makes mirroring easier, and it makes it easier for
>> >  individuals to build new services (web sites primarily) that present new
>> >  interfaces to the Python package collection.  Mirroring for its own sake
>> >  is some use, but being able to grab the entire Python package repository
>> >  easily from a single source is valuable for the second goal, that of
>> >  furnishing the foundation ("shoulders of giants" and all that) for those
>> >  with vision (and round tuits) to take the next step.
> That is fairly easily possible today, even without everybody uploading
> all files. It isn't easy*per se*, but needs a lot of code. However,
> this code has already been written, and using it is fairly easy.
>
>> >  If I wanted to host a site that (e.g.) indexed Python modules from PyPI
>> >  by module (not package) name, and extracted and provided the
>> >  documentation in HTML format, from what I've been reading I'd have to
>> >  build a scraper or XMLRPC tool to walk PyPI, and then for each package,
>> >  download it from another site (that may not have the uptime or
>> >  scalability of PyPI), a nontrivial burden on aspiring visionaries that
>> >  just want to build an addition and then go have a beer and discuss
>> >  further improvements.
> Not at all. You would just use one of the ten or so packages that
> already do precisely that, and use it.
>
>> >  (As a point of practical interest, what_would_  be the most efficient
>> >  way to download the entire set of Python modules listed on PyPI? A
>> >  search comes up with z3c.pypimirror,
>> >  http://pypi.python.org/pypi/z3c.pypimirror; is this the standard tool?)
> There are a number of other mirroring tools, such as EggBasket and
> collective.eggproxy. For mirroring the whole index, pypimirror is
> probably the best starting point.

For starters, this is how z3c.pypimirror works:

1) Initial fetch: Traverse http://pypi.python.org/simple/AOPython and 
for each package's index.html, for each link in it, if the link is a) an 
actual sdist tarball, download it, b) an external link (project 
homepage), go scrap the homepage to find any download links (.tar.gz, 
.zip) and download it.

2) [run this every day] Update for last 24hrs: Use XmlRpc `changelog` 
method that returns recently released packages in last 24 hrs; and redo 
the above operation for these updated packages.

It is unreliable [bugs.launchpad.net/pypi-mirror/+bug/386143] and lacks 
the pre-extracted metadata. I wouldn't call it a mirror tool, for it is 
not an exact copy of PyPI data[1].

I doubt that proliferation of mirror sites / thirdparty tools can happen 
with anything but simple rsync/ftp based archives.

-srid

***
[1] "In computing, a mirror is an exact copy of a data set. On the 
Internet, a mirror site is an exact copy of another Internet site. ..."


More information about the Distutils-SIG mailing list