[Catalog-sig] Prototype setuptools-specific PyPI index.

Jim Fulton jim at zope.com
Thu Jul 19 13:06:34 CEST 2007


Over the past few months, we've struggled quite a bit with Python  
Package Index (PyPI) performance and stability.  Thanks to the heroic  
efforts of Martin v. Löwis and others, performance and especially  
stability have improved quite a bit. Martin has demonstrated that, at  
least when running well, PyPI seems to answer most requests on the  
order of 7 miliseconds (around 150 requests per second) internally.   
That's not bad.  Unfortunately for users, actual times can be quite a  
bit longer.  For me at work, request take around 300 milliseconds.   
For Martin, they seem to take somewhat longer.  300 milliseconds  
isn't so bad for a request or two, however, easy install can easily  
make 10s or even hundreds of requests to satisfy a user request for a  
package.  zc.buildout, when verifying that a large system with many  
tens of packages has the most up to date versions of each package can  
easily make thousands of requests.

Why do setuptools and buildout make so many requests?  If a package  
exposes more than one release, then setuptools checks the package's  
main PyPI page and the pages for each release.  We need to be able to  
easily use older releases, so we can't hide old releases.  Typical  
projects of ours have many old releases exposed.  If setuptools was  
more clever in the way it searched PyPI, but it would still have to  
make a minimum of 2 requests per package for packages with multiple  
versions exposed.

Another potential issue is that PyPI pages can be large.  I've found  
it convenient to use PyPI package pages as the home page for many of  
my projects.  I like to include package documentation in my project  
pages.  Perhaps this is an abuse of PyPI, but it is very convenient  
for me and no one has complained. :)  The zc.buildout pages are  
around 200K.  That's a fair bit of data for setuptools to download  
and scan for download URLs.

In the course of this discussion, I've realized that it doesn't make  
sense for setuptools to use the same interface that humans use.   
setuptools doesn't need to see all of the data that is useful to  
humans. Similarly, humans generally don't need to see all of the  
historical releases for a project.  I suggested a simple page format  
designed just for setuptools.  An alternative would be an xmlrpc  
API.  I prefer pages because I think that, over time, the amount of  
requests from automated tools like easy_install and zc.buildout will  
increase substantially and ultimately, will overwhelm dynamic  
servers, even ones like PyPI that are reasonably fast.  I also think  
that a simple static collection of pages will be easier to mirror and  
I think some number of geographic mirrors is likely to help some  
people.  I promised to prototype the format I suggested.

I've created and experimental prototype setuptools-specific package  
index at

   http://download.zope.org/ppix

Going to that page gives brief instructions for using it with  
easy_install and zc.buildout.  To see an individual package page, add  
the package name to the URL, as in:

   http://download.zope.org/ppix/setuptools/

A few things to note about this:

- I don't expose a long package list at http://download.zope.org/ 
ppix/.  The long package list would be expensive to download and  
supports a use case that I consider to be of negative value, which is  
installing packages with case-insensitive package names,  I think it  
is important for humans to be able to search for packages using case- 
insensitive search terms, but I think that, after identifying a  
package, precise package names should be used.  I think it is  
especially important that precise package names be used in package  
requirements.

- There is a single page per package.  This can greatly reduce the  
number of requests.  Packages that store all of their distributions  
in PyPI and that don't have off-site home pages or download URLs can  
be scanned with a single request.  Note that I excluded home page and  
download URLs that pointed back to the packages PyPI page, as that  
wouldn't provide any new information to setuptools.

- Download URLs for *hidden* packages are included.  Humans don't  
need to see old revisions, but setuptools-based tools do.  If we used  
an index like this for setuptools, we could stop unhiding old  
releases when we created new releases in PyPI.  This would make PyPI  
more useful to humans and less of a pain for developers.

- Download URLs are the same as they are in PyPI.  Using this new  
index, distributions are still downloaded from PyPI, so the index  
doesn't affect PyPI download statistics.

To see the impact of this, it's interesting to look at installing  
zc.buildout using easy_install from PyPI and from the experimental  
index:
Installing using PyPI looks like this:

   (env)jim at ds9:~/tmp$ time easy_install zc.buildout
   Searching for zc.buildout
   Reading http://cheeseshop.python.org/pypi/zc.buildout/
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b19
   Reading http://svn.zope.org/zc.buildout
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b22
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b23
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b20
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b21
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b26
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b27
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b24
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b25
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b28
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b17
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b16
   Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b18
   Best match: zc.buildout 1.0.0b28
   Downloading http://cheeseshop.python.org/packages/2.5/z/ 
zc.buildout/zc.buildout-1.0.0b28- 
py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
   Processing zc.buildout-1.0.0b28-py2.5.egg
   creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- 
py2.5.egg
   Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/ 
python2.5
   Adding zc.buildout 1.0.0b28 to easy-install.pth file
   Installing buildout script to /home/jim/tmp/env/bin/

   Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- 
py2.5.egg
   Processing dependencies for zc.buildout
   Searching for setuptools==0.6c6
   Best match: setuptools 0.6c6
   Processing setuptools-0.6c6-py2.5.egg
   Adding setuptools 0.6c6 to easy-install.pth file
   Installing easy_install script to /home/jim/tmp/env/bin/
   Installing easy_install-2.5 script to /home/jim/tmp/env/bin/

   Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
   Processing dependencies for setuptools==0.6c6
   Finished processing dependencies for setuptools==0.6c6
   Finished installing setuptools==0.6c6
   Finished processing dependencies for zc.buildout
   Finished installing zc.buildout

   real	0m31.360s
   user	0m1.136s
   sys	0m0.060s

Note the large number of pages read.  Here I was installing a single  
package with one dependency, setuptools, that was already installed.  
Let's look at this again using the experimental index:

   (env)jim at ds9:~/tmp$ time easy_install -i http://download.zope.org/ 
ppix zc.buildout
   Searching for zc.buildout
   Reading http://download.zope.org/ppix/zc.buildout/
   Best match: zc.buildout 1.0.0b28
   Downloading http://cheeseshop.python.org/packages/2.5/z/ 
zc.buildout/zc.buildout-1.0.0b28- 
py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
   Processing zc.buildout-1.0.0b28-py2.5.egg
   creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- 
py2.5.egg
   Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/ 
python2.5
   Adding zc.buildout 1.0.0b28 to easy-install.pth file
   Installing buildout script to /home/jim/tmp/env/bin/

   Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- 
py2.5.egg
   Processing dependencies for zc.buildout
   Searching for setuptools==0.6c6
   Best match: setuptools 0.6c6
   Processing setuptools-0.6c6-py2.5.egg
   Adding setuptools 0.6c6 to easy-install.pth file
   Installing easy_install script to /home/jim/tmp/env/bin/
   Installing easy_install-2.5 script to /home/jim/tmp/env/bin/

   Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
   Processing dependencies for setuptools==0.6c6
   Finished processing dependencies for setuptools==0.6c6
   Finished installing setuptools==0.6c6
   Finished processing dependencies for zc.buildout
   Finished installing zc.buildout

   real	0m7.006s
   user	0m0.244s
   sys	0m0.040s

Note:

- We made far fewer requests with the new index

- Most of the time in the second example was spent actually  
downloading the buildout distribution.  Most of the time in the first  
example was spent reading the index.

- I used workingenv to create clean environments for each of the  
examples above.

WRT zc.buildout, refreshing a buildout with just ZODB installed in it  
takes about 45 seconds for me using PyPI and about 5 seconds using  
the experimental index.

Some of the speed improvements is due to the fact that the  
experimental index is much closer to me (on the net) than PyPI.  ATM,  
requests to PyPI take *me* around 500 milliseconds, while requests to  
the experimental index are taking between 100 and 300 milliseconds.  
(I'm at home and this seems to be somewhat variable.)  Most of the  
speed improvements are from reducing the number of requests.

I'm polling PyPI once a minute to get and apply updates. Thanks to  
the new XML-RPC method that Martin added, this is very efficient to do.

I encourage people to check this out and even try using it with  
easy_install and especially buildout. AFAIK, aside from being much  
faster and showing download files for hidden releases it is  
completely equivalent to PyPI for setuptools use.  My intension is to  
keep this experimental index going and up to date for the foreseeable  
future and plan to use it for all my work.

My primary goal is to prototype the new index format.  If this seems  
useful, then I think that www.python.org should expose an index in  
this format to setuptools, either at a different URL or by satisfying  
setuptools requests from the index based on client information.  I'd  
love to see this index populated via a baking mechanism that updates  
package pages when they change, rather than through polling as I'm  
doing.

There would be some benefit to having geographic mirrors.  I suspect  
that having such mirrors available would improve performance further,  
at least for some folks.  It might also be useful to have some  
mirrors for redundancy purposes.  Note though that what I'm doing is  
mirroring the only index data. I'm not mirroring distributions.  Of  
course, I'd be happy to make my software available. (It already is  
via our subversion repository.)

I hope this effort spurs useful discussion and progress.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org





More information about the Catalog-SIG mailing list