[Catalog-sig] Prototype setuptools-specific PyPI index.
Jim Fulton
jim at zope.com
Thu Jul 19 13:06:34 CEST 2007
Over the past few months, we've struggled quite a bit with Python
Package Index (PyPI) performance and stability. Thanks to the heroic
efforts of Martin v. Löwis and others, performance and especially
stability have improved quite a bit. Martin has demonstrated that, at
least when running well, PyPI seems to answer most requests on the
order of 7 miliseconds (around 150 requests per second) internally.
That's not bad. Unfortunately for users, actual times can be quite a
bit longer. For me at work, request take around 300 milliseconds.
For Martin, they seem to take somewhat longer. 300 milliseconds
isn't so bad for a request or two, however, easy install can easily
make 10s or even hundreds of requests to satisfy a user request for a
package. zc.buildout, when verifying that a large system with many
tens of packages has the most up to date versions of each package can
easily make thousands of requests.
Why do setuptools and buildout make so many requests? If a package
exposes more than one release, then setuptools checks the package's
main PyPI page and the pages for each release. We need to be able to
easily use older releases, so we can't hide old releases. Typical
projects of ours have many old releases exposed. If setuptools was
more clever in the way it searched PyPI, but it would still have to
make a minimum of 2 requests per package for packages with multiple
versions exposed.
Another potential issue is that PyPI pages can be large. I've found
it convenient to use PyPI package pages as the home page for many of
my projects. I like to include package documentation in my project
pages. Perhaps this is an abuse of PyPI, but it is very convenient
for me and no one has complained. :) The zc.buildout pages are
around 200K. That's a fair bit of data for setuptools to download
and scan for download URLs.
In the course of this discussion, I've realized that it doesn't make
sense for setuptools to use the same interface that humans use.
setuptools doesn't need to see all of the data that is useful to
humans. Similarly, humans generally don't need to see all of the
historical releases for a project. I suggested a simple page format
designed just for setuptools. An alternative would be an xmlrpc
API. I prefer pages because I think that, over time, the amount of
requests from automated tools like easy_install and zc.buildout will
increase substantially and ultimately, will overwhelm dynamic
servers, even ones like PyPI that are reasonably fast. I also think
that a simple static collection of pages will be easier to mirror and
I think some number of geographic mirrors is likely to help some
people. I promised to prototype the format I suggested.
I've created and experimental prototype setuptools-specific package
index at
http://download.zope.org/ppix
Going to that page gives brief instructions for using it with
easy_install and zc.buildout. To see an individual package page, add
the package name to the URL, as in:
http://download.zope.org/ppix/setuptools/
A few things to note about this:
- I don't expose a long package list at http://download.zope.org/
ppix/. The long package list would be expensive to download and
supports a use case that I consider to be of negative value, which is
installing packages with case-insensitive package names, I think it
is important for humans to be able to search for packages using case-
insensitive search terms, but I think that, after identifying a
package, precise package names should be used. I think it is
especially important that precise package names be used in package
requirements.
- There is a single page per package. This can greatly reduce the
number of requests. Packages that store all of their distributions
in PyPI and that don't have off-site home pages or download URLs can
be scanned with a single request. Note that I excluded home page and
download URLs that pointed back to the packages PyPI page, as that
wouldn't provide any new information to setuptools.
- Download URLs for *hidden* packages are included. Humans don't
need to see old revisions, but setuptools-based tools do. If we used
an index like this for setuptools, we could stop unhiding old
releases when we created new releases in PyPI. This would make PyPI
more useful to humans and less of a pain for developers.
- Download URLs are the same as they are in PyPI. Using this new
index, distributions are still downloaded from PyPI, so the index
doesn't affect PyPI download statistics.
To see the impact of this, it's interesting to look at installing
zc.buildout using easy_install from PyPI and from the experimental
index:
Installing using PyPI looks like this:
(env)jim at ds9:~/tmp$ time easy_install zc.buildout
Searching for zc.buildout
Reading http://cheeseshop.python.org/pypi/zc.buildout/
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b19
Reading http://svn.zope.org/zc.buildout
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b22
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b23
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b20
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b21
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b26
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b27
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b24
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b25
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b28
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b17
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b16
Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b18
Best match: zc.buildout 1.0.0b28
Downloading http://cheeseshop.python.org/packages/2.5/z/
zc.buildout/zc.buildout-1.0.0b28-
py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
Processing zc.buildout-1.0.0b28-py2.5.egg
creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28-
py2.5.egg
Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/
python2.5
Adding zc.buildout 1.0.0b28 to easy-install.pth file
Installing buildout script to /home/jim/tmp/env/bin/
Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28-
py2.5.egg
Processing dependencies for zc.buildout
Searching for setuptools==0.6c6
Best match: setuptools 0.6c6
Processing setuptools-0.6c6-py2.5.egg
Adding setuptools 0.6c6 to easy-install.pth file
Installing easy_install script to /home/jim/tmp/env/bin/
Installing easy_install-2.5 script to /home/jim/tmp/env/bin/
Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
Processing dependencies for setuptools==0.6c6
Finished processing dependencies for setuptools==0.6c6
Finished installing setuptools==0.6c6
Finished processing dependencies for zc.buildout
Finished installing zc.buildout
real 0m31.360s
user 0m1.136s
sys 0m0.060s
Note the large number of pages read. Here I was installing a single
package with one dependency, setuptools, that was already installed.
Let's look at this again using the experimental index:
(env)jim at ds9:~/tmp$ time easy_install -i http://download.zope.org/
ppix zc.buildout
Searching for zc.buildout
Reading http://download.zope.org/ppix/zc.buildout/
Best match: zc.buildout 1.0.0b28
Downloading http://cheeseshop.python.org/packages/2.5/z/
zc.buildout/zc.buildout-1.0.0b28-
py2.5.egg#md5=4e37e53f010ed7984555a029732f479d
Processing zc.buildout-1.0.0b28-py2.5.egg
creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28-
py2.5.egg
Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/
python2.5
Adding zc.buildout 1.0.0b28 to easy-install.pth file
Installing buildout script to /home/jim/tmp/env/bin/
Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28-
py2.5.egg
Processing dependencies for zc.buildout
Searching for setuptools==0.6c6
Best match: setuptools 0.6c6
Processing setuptools-0.6c6-py2.5.egg
Adding setuptools 0.6c6 to easy-install.pth file
Installing easy_install script to /home/jim/tmp/env/bin/
Installing easy_install-2.5 script to /home/jim/tmp/env/bin/
Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6-py2.5.egg
Processing dependencies for setuptools==0.6c6
Finished processing dependencies for setuptools==0.6c6
Finished installing setuptools==0.6c6
Finished processing dependencies for zc.buildout
Finished installing zc.buildout
real 0m7.006s
user 0m0.244s
sys 0m0.040s
Note:
- We made far fewer requests with the new index
- Most of the time in the second example was spent actually
downloading the buildout distribution. Most of the time in the first
example was spent reading the index.
- I used workingenv to create clean environments for each of the
examples above.
WRT zc.buildout, refreshing a buildout with just ZODB installed in it
takes about 45 seconds for me using PyPI and about 5 seconds using
the experimental index.
Some of the speed improvements is due to the fact that the
experimental index is much closer to me (on the net) than PyPI. ATM,
requests to PyPI take *me* around 500 milliseconds, while requests to
the experimental index are taking between 100 and 300 milliseconds.
(I'm at home and this seems to be somewhat variable.) Most of the
speed improvements are from reducing the number of requests.
I'm polling PyPI once a minute to get and apply updates. Thanks to
the new XML-RPC method that Martin added, this is very efficient to do.
I encourage people to check this out and even try using it with
easy_install and especially buildout. AFAIK, aside from being much
faster and showing download files for hidden releases it is
completely equivalent to PyPI for setuptools use. My intension is to
keep this experimental index going and up to date for the foreseeable
future and plan to use it for all my work.
My primary goal is to prototype the new index format. If this seems
useful, then I think that www.python.org should expose an index in
this format to setuptools, either at a different URL or by satisfying
setuptools requests from the index based on client information. I'd
love to see this index populated via a baking mechanism that updates
package pages when they change, rather than through polling as I'm
doing.
There would be some benefit to having geographic mirrors. I suspect
that having such mirrors available would improve performance further,
at least for some folks. It might also be useful to have some
mirrors for redundancy purposes. Note though that what I'm doing is
mirroring the only index data. I'm not mirroring distributions. Of
course, I'd be happy to make my software available. (It already is
via our subversion repository.)
I hope this effort spurs useful discussion and progress.
Jim
--
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the Catalog-SIG
mailing list