Re: [Numpy-discussion] Announcing toydist, improving distribution and packaging situation

30 Dec 2009

      hello again,

On Tue, Dec 29, 2009 at 2:22 PM, David Cournapeau <cournape@gmail.com> wrote:
...
On Tue, Dec 29, 2009 at 10:27 PM, René Dudfield <renesd@gmail.com> wrote:
...
Hi,
In the toydist proposal/release notes, I would address 'what does
toydist do better' more explicitly.
**** A big problem for science users is that numpy does not work with
pypi + (easy_install, buildout or pip) and python 2.6. ****
Working with the rest of the python community as much as possible is
likely a good goal.
Yes, but it is hopeless. Most of what is being discussed on
distutils-sig is useless for us, and what matters is ignored at best.
I think most people on distutils-sig are misguided, and I don't think
the community is representative of people concerned with packaging
anyway - most of the participants seem to be around web development,
and are mostly dismissive of other's concerns (OS packagers, etc...).
Sitting down with Tarek(who is one of the current distutils
maintainers) in Berlin we had a little discussion about packaging over
pizza and beer... and he was quite mindful of OS packagers problems
and issues.  He was also interested to hear about game developers
issues with packaging (which are different again to scientific
users... but similar in many ways).

However these systems were developed by the zope/plone/web crowd, so
they are naturally going to be thinking a lot about zope/plone/web
issues.  Debian, and ubuntu packages for them are mostly useless
because of the age.  Waiting a couple of years for your package to be
released is just not an option (waiting even an hour for bug fixes is
sometimes not an option).  Also isolation of packages is needed for
machines that have 100s of different applications running, written by
different people, each with dozens of packages used by each
application.

Tools like checkinstall and stdeb ( http://pypi.python.org/pypi/stdeb/
) can help with older style packaging systems like deb/rpm.  I think
perhaps if toydist included something like stdeb as not an extension
to distutils, but a standalone tool (like toydist) there would be less
problems with it.

One thing the various zope related communities do is make sure all the
relevant and needed packages are built/tested by their compile farms.
This makes pypi work for them a lot better than a non-coordinated
effort does.  There are also lots of people trying out new versions
all of the time.
...
I want to note that I am not starting this out of thin air - I know
most of distutils code very well, I have been the mostly sole
maintainer of numpy.distutils for 2 years now. I have written
extensive distutils extensions, in particular numscons which is able
to fully build numpy, scipy and matplotlib on every platform that
matters.
Simply put, distutils code is horrible (this is an objective fact) and
 flawed beyond repair (this is more controversial). IMHO, it has
almost no useful feature, except being standard.
yes, I have also battled with distutils over the years.  However it is
simpler than autotools (for me... maybe distutils has perverted my
fragile mind), and works on more platforms for python than any other
current system.  It is much worse for C/C++ modules though.  It needs
dependency, and configuration tools for it to work better (like what
many C/C++ projects hack into distutils themselves).

Monkey patching, and extensions are especially a problem... as is the
horrible code quality of distutils by modern standards.  However
distutils has had more tests and testing systems added, so that
refactoring/cleaning up of distutils can happen more so.
...
If you want a more detailed explanation of why I think distutils and
all tools on top are deeply flawed, you can look here:
http://cournape.wordpress.com/2009/04/01/python-packaging-a-few-observations...
I agree with many things in that post.  Except your conclusion on
multiple versions of packages in isolation.  Package isolation is like
processes, and package sharing is like threads - and threads are evil!
 Leave my python site-packages directory alone I say... especially
don't let setuptools infect it :)  Many people currently find the
multi versions of packages in isolation approach works well for them -
so for some use cases the tools are working wonderfully.
...
...
numpy used to work with buildout in python2.5, but not with 2.6.
buildout lets other team members get up to speed with a project by
running one command.  It installs things in the local directory, not
system wide.  So you can have different dependencies per project.
I don't think it is a very useful feature, honestly. It seems to me
that they created a huge infrastructure to split packages into tiny
pieces, and then try to get them back together, imaganing that
multiple installed versions is a replacement for backward
compatibility. Anyone with extensive packaging experience knows that's
a deeply flawed model in general.
Science is supposed to allow repeatability.  Without the same versions
of packages, repeating experiments is harder.  This is a big problem
in science that multiple versions of packages in _isolation_ can help
get to a solution to the repeatability problem.

Just pick some random paper and try to reproduce their results.  It's
generally very hard, unless the software is quite well packaged.
Especially for graphics related papers, there are often many different
types of environments, so setting up the environments to try out their
techniques, and verify results quickly is difficult.

Multiple versions are not a replacement for backwards compatibility,
just a way to avoid the problem in the short term to avoid being
blocked.  If a new package version breaks your app, then you can
either pin it to an old version, fix your app, or fix the package.  It
is also not a replacement for building on stable high quality
components, but helps you work with less stable, and less high quality
components - at a much faster rate of change, with a much larger
dependency list.
...
...
Plenty of good work is going on with python packaging.
That's the opposite of my experience. What I care about is:
 - tools which are hackable and easily extensible
 - robust install/uninstall
 - real, DAG-based build system
 - explicit and repeatability
None of this is supported by the tools, and the current directions go
even further away. When I have to explain at length why the
command-based design of distutils is a nightmare to work with, I don't
feel very confident that the current maintainers are aware of the
issues, for example. It shows that they never had to extend distutils
much.
All agreed!  I'd add to the list parallel builds/tests (make -j 16),
and outputting to native build systems.  eg, xcode, msvc projects, and
makefiles.

It would interesting to know your thoughts on buildout recipes ( see
creating recipes http://www.buildout.org/docs/recipe.html ).  They
seem to work better from my perspective.  However, that is probably
because of isolation.  The recipe are only used by those projects that
require them.  So the chance of them interacting are lower, as they
are not installed in the main python.

How will you handle toydist extensions so that multiple extensions do
not have problems with each other?  I don't think this is possible
without isolation, and even then it's still a problem.

Note, the section in the distutils docs on creating command extensions
is only around three paragraphs.  There is also no central place to go
looking for extra commands  (that I know of).  Or a place to document
or share each others command extensions.

Many of the methods for extending distutils are not very well
documented either.  For example, 'how do I you change compiler command
line arguments for certain source files?'  Basic things like that are
possible with disutils, but not documented (very well).
...
...
There are build farms for windows packages and OSX uploaded to pypi.
Start uploading pre releases to pypi, and you get these for free (once
you make numpy compile out of the box on those compile farms).  There
are compile farms for other OSes too... like ubuntu/debian, macports
etc.  Some distributions even automatically download, compile and
package new releases once they spot a new file on your ftp/web site.
I am familiar with some of those systems (PPA and opensuse build
service in particular). One of the goal of my proposal is to make it
easier to interoperate with those tools.
yeah, cool.
...
I think Pypi is mostly useless. The lack of enforced metadata is a big
no-no IMHO. The fact that Pypi is miles beyond CRAN for example is
quite significant. I want CRAN for scientific python, and I don't see
Pypi becoming it in the near future.
The point of having our own Pypi-like server is that we could do the following:
 - enforcing metadata
 - making it easy to extend the service to support our needs
Yeah, cool.  Many other projects have their own servers too.
pygame.org, plone, etc etc, which meet their own needs.  Patches are
accepted for pypi btw.

What type of enforcements of meta data, and how would they help?  I
imagine this could be done in a number of ways to pypi.
- a distutils command extension that people could use.
- change pypi source code.
- check the metadata for certain packages, then email their authors
telling them about issues.
...
...
pypm:  http://pypm.activestate.com/list-n.html#numpy
It is interesting to note that one of the maintainer of pypm has
recently quitted the discussion about Pypi, most likely out of
frustration from the other participants.
yeah, big mailing list discussions hardly ever help I think :)  oops,
this is turning into one.
...
...
Documentation projects are being worked on to document, give tutorials
and make python packaging be easier all round.  As witnessed by 20 or
so releases on pypi every day(and growing), lots of people are using
the python packaging tools successfully.
This does not mean much IMO. Uploading on Pypi is almost required to
use virtualenv, buildout, etc.. An interesting metric is not how many
packages are uploaded, but how much it is used outside developers.
Yeah, it only means that there are lots of developers able to use the
packaging system to put their own packages up there.  However there
are over 500 science related packages on there now - which is pretty
cool.

A way to measure packages being used would be by downloads, and by
which packages depend on which other packages.  I think the science
ones would be reused lower than normal, since a much higher percentage
are C/C++ based, and are likely to be more fragile packages.
...
...
I'm not sure making a separate build tool is a good idea.  I think
going with the rest of the python community, and improving the tools
there is a better idea.
It has been tried, and IMHO has been proved to have failed. You can
look at the recent discussion (the one started by Guido in
particular).
I don't think 500+ science related packages is a total failure really.
...
...
pps. some notes on toydist itself.
- toydist convert is cool for people converting a setup.py .  This
means that most people can try out toydist right away.  but what does
it gain these people who convert their setup.py files?
Not much ATM, except that it is easier to write a toysetup.info
compared to setup.py IMO, and that it supports a simple way to include
data files (something which is currently *impossible* to do without
writing your own distutils extensions). It has also the ability to
build eggs without using setuptools (I consider not using setuptools a
feature, given the too many failure modes of this package).
yeah, I always make setuptools not used in my packages by default.
However I use command line arguments to use the features of setuptools
required (eggs, bdist_mpkg etc etc).

Having a tool to create eggs without setuptools would be great in
itself.  Definitely list this in the feature list :)
...
The main goals though are to make it easier to build your own tools on
top of if, and to integrate with real build systems.
yeah, cool.
...
...
- a toydist convert that generates a setup.py file might be cool :)
toydist started like this, actually: you would write a setup.py file
which loads the package from toysetup.info, and can be converted to a
dict argument to distutils.core.setup. I have not updated it recently,
but that's definitely on the TODO list for a first alpha, as it would
enable people to benefit from the format, with 100 % backward
compatibility with distutils.
yeah, cool.  That would let you develop things incrementally too, and
still have toydist be useful for the whole development period until it
catches up with the features of distutils needed.
...
...
- arbitrary code execution happens when building or testing with
toydist.
You are right for testing, but wrong for building. As long as the
build is entirely driven by toysetup.info, you only have to trust
toydist (which is not safe ATM, but that's an implementation detail),
and your build tools of course.
If you execute build tools on arbitrary code, then arbitrary code
execution is easy for someone who wants to do bad things.  Trust and
secondarily sandboxing are the best ways to solve these problems imho.
...
Obviously, if you have a package which uses an external build tool on
top of toysetup.info (as will be required for numpy itself for
example), all bets are off. But I think that's a tiny fraction of the
interesting packages for scientific computing.
yeah, currently 1/5th of science packages use C/C++/fortran/cython etc
(see http://pypi.python.org/pypi?:action=browse&c=40 110/458 on that
page ).  There seems to be a lot more using C/C++ compared to other
types of pakages on there (eg zope3 packages list 0 out of 900
packages using C/C++).

So the hight number of C/C++ science related packages on pypi
demonstrate that better C/C++ tools for scientific packages is a big
need.  Especially getting compile/testing farms for all these
packages.  Getting compile farms is a big need compared to python
packages - since C/C++ is MUCH harder to write/test in a portable way.
 I would say it is close to impossible to get code to work without
quite good knowledge on multiple platforms without errors.  There are
many times with pygame development that I make changes on an osx,
windows or linux box, commit the change, then wait for the
compile/tests to run on the build farm (
http://thorbrian.com/pygame/builds.php ).  Releasing packages
otherwise makes the process *heaps* longer... and many times I still
get errors on different platforms, despite many years of multi
platform coding.
...
Sandboxing is particularly an issue on windows - I don't know a good
solution for windows sandboxing, outside of full vms, which are
heavy-weights.
yeah, VMs are the way to go.  If only to make the copies a fresh
install each time.  However I think automated distributed building,
and trust are more useful.  ie, only build those packages where you
trust the authors, and let anyone download, build and then post their
build/test results.  MS have given out copies of windows to some
people to set up VMs for building to different members of the python
community in the past.

By automated distributed building, I mean what happens with mailing
lists usually.  Where people post their test results when they have a
problem.  Except in a more automated manner.  Adding a 'Do you want to
upload your build/test results?' at the end of a setup.py for
subversion builds would give you dozens or hundreds of test results
daily from all sorts of machines.  Making it easy for people to set up
package builders which also upload their packages somewhere gives you
distributed package building, in a fairly safe automated manner.
(more details here:
http://renesd.blogspot.com/2009/09/python-build-bots-down-maybe-they-need.ht...
)
...
...
- it should be possible to build this toydist functionality as a
distutils/distribute/buildout extension.
No, it cannot, at least as far as distutils/distribute are concerned
(I know nothing about buildout). Extending distutils is horrible, and
fragile in general. Even autotools with its mix of generated sh
scripts through m4 and perl is a breeze compared to distutils.
...
- extending toydist?  How are extensions made?  there are 175 buildout
packages which extend buildout, and many that extend
distutils/setuptools - so extension of build tools in a necessary
thing.
See my answer earlier about interoperation with build tools.
I'm still not clear on how toydist will be extended.  I am however, a
lot clearer about its goals.

cheers,