[Python-Dev] setuptools: past, present, future
Phillip J. Eby
pje at telecommunity.com
Fri Apr 21 21:57:42 CEST 2006
I've noticed that there seems to be a lot of confusion out there about what
setuptools is and/or does, at least among Python-Dev folks, so I thought it
might be a good idea to give an overview of its structure, so that people
have a better idea of what is and isn't "magic".
Setuptools began as a fairly routine collection of distutils extensions, to
do the same boring things that everybody needs distutils extensions to
do. Basic stuff like installing data with your packages, running unit
tests, that sort of thing.
At some point, I was getting tired of having to deal with dependencies by
making people install them manually, or else having to bundle them. I
wanted a more automated way to deal with this problem, and in 2004 brought
the problem to the distutils-sig and planned to do a PyCon sprint to try to
address the problem. Tim Peters encouraged me to move the preliminary work
I'd done to the Python sandbox, where others could follow the work and
improve upon it, and he sponsored me for CVS privileges so I could do so.
As it turned out, I wasn't able to go to PyCon, but I produced some crude
stuff to try to implement dependency handling, based on some previous work
by Bob Ippolito. Bob's stuff used imports to check version strings, and
mine was a bit more sophisticated in that it could scan .py or .pyc files
without actually importing them. But there was no reasonable way to track
download URLs, though, or deal with the myriad package formats (source,
RPM, etc.) platform-specificness, etc. and PyPI didn't really exist yet.
To top it all off, within a couple of months I was laid off, so the problem
ceased to be of immediate practical interest for me any more. I decided to
take a six-month sabbatical and work on RuleDispatch, after which I began
contracting for OSAF.
OSAF's Chandler application has a plugin platform akin to Eclipse, and I
saw that it was going to need a cross-platform plugin format. I put out
the call to distutils-sig, and Bob Ippolito took up the challenge. We
designed the first egg format, and we agreed that it should support Python
libraries, not just plugins, and that it should be possible to treat .egg
zipfiles and directories interchangeably, and that it should be possible to
put more than one conceptual egg into one physical zipfile. The true "egg"
was the project release, not the zipfile itself. (We called a zipfile
containing multiple eggs a "basket", which we thought would be useful for
things like py2exe. pkg_resources still supports baskets today, but there
are no tools for generating them - you have to just zip up a bunch of .egg
directories to make one.)
Bob wrote the prototype pkg_resources module to support accessing resources
in zipfiles and regular directories, while I worked on creating a bdist_egg
command, which I added to the then-dormant setuptools package, figuring
that the experimental dependency stuff could be later refactored to allow
dependencies to be resolved using eggs. We had a general notion that there
would be some kind of web pages you could use to list packages on, since at
that time PyPI didn't allow uploads yet. Or at any rate, we didn't know
about it until PyCon in 2005.
After PyCon, I kept hearing about projects to make a CPAN-like tool for
Python, such as the Uragas project. However, all of these projects sounded
like they were going to reinvent everything from scratch, particularly a
lot of stuff that Bob and I had just done. It then occurred to me for the
first time that the .egg format could be used to solve the problems both of
having a local package database, and also the uninstallation and upgrade of
packages. In fact, the only piece missing was that there was no way to
find and download the packages to be installed, and if I could solve that
problem, the CPAN problem would be solved.
So, I did some research by taking a random sample of packages from PyPI, to
find out what information people were actually registering. I found that,
more often than not, at least one of their PyPI URLs would point to a page
that had links to packages that could be downloaded directly. And that was
basically enough to permit writing a very simple spider that would only
follow "download" or "homepage" links from PyPI pages, and would also
inspect URLs to see if they were recognizable as distutils-generated
filenames, from which it could extract package name and version info.
Thus, easy_install was born, completing what some people now call the
eggs/setuptools/easy_install trifecta.
If you are going to work on or support these tools, it's important that you
understand that these three things are related, but distinct. Setuptools
is at heart just an ordinary collection of distutils enhancements, that
just happens to include a bdist_egg command. EasyInstall is another
enhanced command built on setuptools, that leverages setuptools to build
eggs for packages that don't have them. But setuptools in its turn depends
on EasyInstall, so that packages can have dependencies.
So the components are:
pkg_resources: standalone module for working with project releases,
dependency specification and resolution, and bundled resources
setuptools: a package of distutils extensions, including ones to build
eggs with
easy_install: a distutils extension built using setuptools, that
finds, downloads, builds eggs for, and installs packages that use either
distutils or setuptools
And if you look at that list, it's pretty easy to see which part is the
most magical, implicit, heuristic, etc. It's easy_install, no
question. If it weren't for the fact that easy_install tries to support
non-setuptools packages, there would be little need for monkeypatching or
sandboxing. If it weren't for the fact that easy_install tries to
interpret web pages, there would be no need for heuristics or guessing.
So, in a perfect world where everybody neatly files everything with PyPI,
easy_install would not have anything implicit about it. But this isn't a
perfect world, and to gain adoption, it had to have backward
compatibility. If easy_install could handle *enough* existing packages,
then it would encourage package authors to use it so that they could depend
on those existing packages. These authors would end up using setuptools,
which would then tend to ensure that *their* package would be
easy_install-able as well.
And, since the user needs setuptools to install these new packages, then
the user now has setuptools, and the option to try using it to install
other packages. Users then encourage package authors to have correct PyPI
information so their packages can be easy_install-ed as well, and the
network effect increases from there.
So, I bundled all three things (pkg_resources, setuptools, and
easy_install) into a single distribution bundle precisely so it would have
this "viral" network effect. I knew that if everybody had to be made to
get their PyPI entries straight *first*, it would never work. But if I
could leverage an ever-growing user population to put pressure on authors
and system packagers, and an ever-growing author population to increase the
number of users, then the natural course of things should be that packages
that don't play will die off, be forked, etc., and those who do play will
be rewarded with more users.
I made an explicit, conscious, and cold-blooded decision to do things that
way, knowing full well that it would immediately kill off all the competing
"CPAN for Python" projects, and that it would also force lots of people to
deal with setuptools who didn't care about it one way or another. The
community as a whole would benefit immensely, even if the costs would be
borne by people who didn't agree with what I was doing.
So, yes, I'm a cold calculating bastard. EasyInstall is #1 in the field
because it was designed to make its competition irrelevant and to virally
spread itself across the entire Python ecosphere. I'm pointing these
things out now because I think it's better not to mince words; easy_install
was designed with Total World Domination in mind from day one and that is
exactly what it's here to do. Compatibility at any cost is its watchword,
because that is what fuels its adoption. End-users are its market, because
what the end users want ultimately controls what the developers and the
packagers do.
Thus, if you look at the history of setuptools, you'll see that the vast
majority of work I do on it is increasing the Just-Works-iness of
easy_install. The majority of changes to non-easy_install code (and both
setuptools.package_index and setuptools.sandbox are there only for
easy_install) are architectural or format changes intended to support
greater justworksiness for easy_install.
(There are also lots of changes included to enhance setuptools' usefulness
as a distutils extension, but these are driven mainly by user requests and
Chandler needs, and there aren't nearly as many such changes.)
So, if you take easy_install and its support modules entirely out of
setuptools, you would be left with a modest assortment of distutils
extensions, most of which don't have any backward compatibility
issues. They could be merged into the distutils with nary a
complaint. The only significant change is the "sdist" command, which in
setuptools supports a cleaner (and extensible) way of managing the source
distribution manifest, that frees developers from messing with the MANIFEST
file and remembering to constantly add junk to MANIFEST.in. And there's
probably some way we could decide to either keep the old behavior or make
the old behavior an option for anybody who's relying on the way it worked
before.
And that's all well and good, but now you don't have the features that are
the real reason end users want the whole thing: easy_install.
And it's not just the users. Package authors want it too. TurboGears
really couldn't exist without this. It's easy to argue that oh, they
could've made distribution packages for six formats and nine platforms, or
they could've made tarballs, etc. to bundle all the dependencies in, but
those approaches really just don't scale -- especially for the single
package author just starting to build something new.
None of these options are economically viable for the author of a new
package, especially if their core competency isn't packaging and
distribution. Now that there's a Turbogears community, yes, there are
probably people available who can do a lot of those distribution-related
tasks. But there wouldn't have *been* a community if Kevin couldn't have
shipped the software by himself!
This is the *real* problem that I always meant to address, from the very
beginning: Python development and distribution *costs too much* for the
community to flourish as it should. It's too hard for non-experts, and
until now it required bundling, system packaging, or asking users to
install their own dependencies. But asking users to install dependencies
doesn't scale for large numbers of dependencies. And not being able to
reuse packages leads to proliferating wheel-reinvention, because
installation cost is a barrier to entry.
So, the work that I've done is simply social engineering through economic
leverage. The goal is to change the cost equations so that entry barriers
for package distribution are low, so that users can try different packages,
so they can switch, so market forces can choose winners. Because switching
and installation costs are low, interoperability and reuse are more
attractive choices, and more likely to be demanded by users. You can
already see these forces taking effect in such developments as the joint
CherryPy/TurboGears template plugin interface, which uses another
setuptools innovation (entry points) to allow plug-and-play.
I am doing all this because I got tired of reinventing wheels. When you
add in installation costs, writing your own package looks more attractive
than reusing the other guy's. But if installation is cheap, then people
are more inclined to overlook the minor differences between how the other
guy did it and how they would have done it, and are more likely to say to
the "other guy", hey, I like this but would you add X? And it's more
likely that the "other guy" will say yes, because it will *multiply* his
install base to get another published package depending on his project.
So, my question to all of you is, is that worth a little implicitness, a
little magic? My answer, of course, is yes. It will probably be a
multi-year effort to get the state of community practice up to a level
where all the heuristics and webscraping can be removed from easy_install,
without negatively affecting the cost equation.
Or maybe not. Maybe we're just hitting the turn of the hockey stick now,
and inclusion in 2.5 is just what the doctor ordered to kick the number of
users so high that anybody would be crazy not to have clean PyPI listings,
I don't know. To be honest, though, I think the outstanding proposal on
Catalog-SIG to merge Grig's "CheeseCake" rating system into PyPI (so that
package authors will be shown what they can do to improve their listing
quality) will actually have more direct impact on this than 2.5 inclusion
will. Guido's choice to bless setuptools is important for system packagers
and developers to have confidence that this is the direction Python is
taking; it doesn't have to actually go *in* 2.5 to do
that. install_egg_info clearly shows the direction we're taking.
So, after reading all the other stuff that's gone by in the last few days,
this is what I think should happen:
First, setuptools should be withdrawn from inclusion in Python 2.5. Not
directly because of the opposition, but because of the simple truth that
it's just not ready. Some of that is because I've spent way too much time
on the discussions this week, to the point of significant sleep deprivation
at one point. But when Guido first asked about it, I had concerns about
getting everything done that really needed to be done, and effectively only
agreed because I figured out a way to allow new versions to be distributed
after-the-fact. With the latest Python 2.5 release schedule, I'd be
hard-pressed to get 0.7 to stability before the 2.5 betas go, certainly if
I'm the only one working on it.
And a stable version of 0.7 is really the minimum that should go in the
standard library, because the package management side of things really
needs to have commands to list, uninstall, upgrade, etc., and they need to
be easy to understand, not the confusing mishmash that is easy_install's
current assortment of options. (Which grew organically, rather than being
designed ahead of time.)
And Fredrik is right to bring up concerns about both easy_install's
confusing array of options, and the general support issues of asking
Python-Dev to adopt setuptools. These are things that can be addressed,
and *are* being addressed, but they're not going to happen by Tuesday, when
the alpha release is scheduled.
I hate to say this, because I really don't want to disappoint Guido or
anyone on Python-Dev or elsewhere who has been calling for it to go in. I
really appreciate all your support, but Fredrik is right, and I can't let
my desire to please all of you get in the way of what's right.
What *should* happen now instead, is a plan for merging setuptools into the
distutils for 2.6. That includes making the decisions about what "install"
and "sdist" should do, and whether backward compatibility of internal
behaviors should be implicit or explicit. I don't want to start *that*
thread right now, and we've already heard plenty of arguments on both
sides. Indeed, since Martin and Marc seem to be diametrically opposed on
that issue, it is guaranteed that *somebody* will be unhappy with whatever
decision is made. :)
Between 2.5 and 2.6, setuptools should continue to be developed in the
sandbox, and keep the name 'setuptools'. For 2.6, however, we should merge
the code bases and have setuptools just be an alias. Or, perhaps what is
now called setuptools should be called "distutils2" and distributed as
such, with "setuptools" only being a legacy name. But regardless, the plan
should be to have only one codebase for 2.6, and to issue backported
releases of that codebase for at least Python 2.4 and 2.5.
These ideas are new for me, because I hadn't thought that anybody would
have cared enough to want to get into the code and share any of the
work. That being the case, it seems to make more sense for me to back off
a little on the development in order to work on developer documentation,
such as of the kind Fredrik has been asking for, and to work on a
development roadmap so we can co-ordinate who will work on what, when, to
get 0.7 to stability.
In the meantime, Python 2.5 *does* have install_egg_info, and it should
definitely not be pulled out. install_egg_info ensures that every package
installed by the distutils is detectable by setuptools, and thus will not
be reinstalled just because it wasn't installed by setuptools.
And there is one other thing that should go into 2.5, and that is PKG-INFO
files for each bundled package that we are including in the standard
library, that is distributed separately for older Python versions and is
API-compatible. So for example, if ctypes 0.9.6 is going in Python 2.5, it
should hav a PKG-INFO in the appropriate directory to say so. Thus,
programs written for Python 2.4 that say they depend on something like
"ctypes>=0.9" will work with Python 2.5 without needing to change their
setup scripts to remove the dependency when the script is run under Python 2.5.
Last, but not least, we need to find an appropriate spot to add
documentation for install_egg_info.
These are tasks that can be accomplished for 2.5, they are reasonably
noncontroversial, and they do not add any new support requirements or
stability issues that I can think of.
One final item that is a possibility: we could leave pkg_resources in for
2.5, and add its documentation. This would allow people to begin using its
API to check for installed packages, accessing resources, etc. I'd be
interested in hearing folks' opinions about that, one way or the other.
More information about the Python-Dev
mailing list