On Wednesday 30 December 2009 06:15:45 René Dudfield wrote:
I agree with many things in that post. Except your conclusion on multiple versions of packages in isolation. Package isolation is like processes, and package sharing is like threads - and threads are evil!
You have stated this several times, but is there any evidence that this is the desire of the majority of users? In the scientific community, interactive experimentation is critical and users are typically not seasoned systems administrators. For such users, almost all packages installed after installing python itself are packages they use. In particular, all I want to do is to use apt/yum to get the packages (or ask my sysadmin, who rightfully has no interest in learning the intricacies of python package installation, to do so) and continue with my work. "Packages-in-isolation" is for people whose job is to run server farms, not interactive experimenters.
Leave my python site-packages directory alone I say... especially don't let setuptools infect it :) Many people currently find the multi versions of packages in isolation approach works well for them - so for some use cases the tools are working wonderfully.
More power to them. But for the rest of us, that approach is too much hassle.
Science is supposed to allow repeatability. Without the same versions of packages, repeating experiments is harder.
Really? IME, this is not the case. Simulations in signal processing are typically run with two different kinds of data sets: - random data for Monte Carlo simulations - well-known and widely available test streams In both kinds of data sets, reimplementation of the same algorithms is rarely, if ever, affected by the versions of packages, primarily because of the wide variety of tool sets (and even more versions) that are in use.
This is a big problem in science that multiple versions of packages in _isolation_ can help get to a solution to the repeatability problem.
Package versions are, at worst, a very minor distraction in solving the repeatability problem. Usually, the main issues are unclear descriptions of the algorithms and unstated assumptions.
Just pick some random paper and try to reproduce their results. It's generally very hard, unless the software is quite well packaged.
In scientific experimentation, it is folly to rely on software from the author of some random paper. In signal processing, almost every critical algorithm is re-implemented, and usually in a different language. The only exceptions are when the software can be validated with a large amount of test data, but this very rare. Usually, you use some package to get started in your current environment. If it works (i.e., results meet your quality metric), you then build on it. If it does not work (even if only due to version incompatibility), you usually jettison it and either find an alternative or rewrite it.
Multiple versions are not a replacement for backwards compatibility, just a way to avoid the problem in the short term to avoid being blocked. If a new package version breaks your app, then you can either pin it to an old version, fix your app, or fix the package. It is also not a replacement for building on stable high quality components, but helps you work with less stable, and less high quality components - at a much faster rate of change, with a much larger dependency list.
This is a software engineer + systems administrator solution. In larger institutions, this absolutely unworkable if you rely on IT for package management/installation.
Plenty of good work is going on with python packaging.
That's the opposite of my experience. What I care about is: - tools which are hackable and easily extensible - robust install/uninstall - real, DAG-based build system - explicit and repeatability
None of this is supported by the tools, and the current directions go even further away. When I have to explain at length why the command-based design of distutils is a nightmare to work with, I don't feel very confident that the current maintainers are aware of the issues, for example. It shows that they never had to extend distutils much.
All agreed! I'd add to the list parallel builds/tests (make -j 16), and outputting to native build systems. eg, xcode, msvc projects, and makefiles.
Essentially out of frustration with distutils and setuptools, I have migrated to CMake for pretty much all my build systems (except for a few scons ones I haven't had to touch for a while) since it supports all the features mentioned above. Even dealing with CMake's god-awful "scripting language" is better than dealing with distutils. I am very happy to see David C's efforts to finally get away from distutils, but I am worried that a cross-platform build system that has all the features that he wants is simply beyond the scope of 1-2 people unless they work on it full time for a year or two.
yeah, currently 1/5th of science packages use C/C++/fortran/cython etc (see http://pypi.python.org/pypi?:action=browse&c=40 110/458 on that page ). There seems to be a lot more using C/C++ compared to other types of pakages on there (eg zope3 packages list 0 out of 900 packages using C/C++).
So the hight number of C/C++ science related packages on pypi demonstrate that better C/C++ tools for scientific packages is a big need. Especially getting compile/testing farms for all these packages. Getting compile farms is a big need compared to python packages - since C/C++ is MUCH harder to write/test in a portable way. I would say it is close to impossible to get code to work without quite good knowledge on multiple platforms without errors.
Not sure that that is quite true. C++ is not a very popular language around here, but the combination of boost+Qt+python+scipy+hdf5+h5py has made virtually all of my platform-specific code vanish (with the exception of some platform-specific stuff in my CMake scripts). Regards, Ravi