Re: [Numpy-discussion] Announcing toydist, improving distribution and packaging situation

30 Dec 2009

      On Wednesday 30 December 2009 06:15:45 René Dudfield wrote:
...
I agree with many things in that post.  Except your conclusion on
multiple versions of packages in isolation.  Package isolation is like
processes, and package sharing is like threads - and threads are evil!
You have stated this several times, but is there any evidence that this is the 
desire of the majority of users? In the scientific community, interactive 
experimentation is critical and users are typically not seasoned systems 
administrators. For such users, almost all packages installed after installing 
python itself are packages they use. In particular, all I want to do is to use 
apt/yum to get the packages (or ask my sysadmin, who rightfully has no 
interest in learning the intricacies of python package installation, to do so) 
and continue with my work. "Packages-in-isolation" is for people whose job is 
to run server farms, not interactive experimenters.
...
Leave my python site-packages directory alone I say... especially
don't let setuptools infect it :)  Many people currently find the
multi versions of packages in isolation approach works well for them -
so for some use cases the tools are working wonderfully.
More power to them. But for the rest of us, that approach is too much hassle.
...
Science is supposed to allow repeatability.  Without the same versions
of packages, repeating experiments is harder.
Really? IME, this is not the case. Simulations in signal processing are 
typically run with two different kinds of data sets:
  - random data for Monte Carlo simulations
  - well-known and widely available test streams
In both kinds of data sets, reimplementation of the same algorithms is rarely, 
if ever, affected by the versions of packages, primarily because of the wide 
variety of tool sets (and even more versions) that are in use.
...
This is a big problem
in science that multiple versions of packages in _isolation_ can help
get to a solution to the repeatability problem.
Package versions are, at worst, a very minor distraction in solving the 
repeatability problem. Usually, the main issues are unclear descriptions of 
the algorithms and unstated assumptions.
...
Just pick some random paper and try to reproduce their results.  It's
generally very hard, unless the software is quite well packaged.
In scientific experimentation, it is folly to rely on software from the author 
of some random paper. In signal processing, almost every critical algorithm is 
re-implemented, and usually in a different language. The only exceptions are 
when the software can be validated with a large amount of test data, but this 
very rare. Usually, you use some package to get started in your current 
environment. If it works (i.e., results meet your quality metric), you then 
build on it. If it does not work (even if only due to version 
incompatibility), you usually jettison it and either find an alternative or 
rewrite it.
...
Multiple versions are not a replacement for backwards compatibility,
just a way to avoid the problem in the short term to avoid being
blocked.  If a new package version breaks your app, then you can
either pin it to an old version, fix your app, or fix the package.  It
is also not a replacement for building on stable high quality
components, but helps you work with less stable, and less high quality
components - at a much faster rate of change, with a much larger
dependency list.
This is a software engineer + systems administrator solution. In larger 
institutions, this absolutely unworkable if you rely on IT for package 
management/installation.
...
...
...
Plenty of good work is going on with python packaging.
That's the opposite of my experience. What I care about is:
 - tools which are hackable and easily extensible
 - robust install/uninstall
 - real, DAG-based build system
 - explicit and repeatability
None of this is supported by the tools, and the current directions go
even further away. When I have to explain at length why the
command-based design of distutils is a nightmare to work with, I don't
feel very confident that the current maintainers are aware of the
issues, for example. It shows that they never had to extend distutils
much.
All agreed!  I'd add to the list parallel builds/tests (make -j 16),
and outputting to native build systems.  eg, xcode, msvc projects, and
makefiles.
Essentially out of frustration with distutils and setuptools, I have migrated 
to CMake for pretty much all my build systems (except for a few scons ones I 
haven't had to touch for a while) since it supports all the features mentioned 
above. Even dealing with CMake's god-awful "scripting language" is better than 
dealing with distutils. 

I am very happy to see David C's efforts to finally get away from distutils, 
but I am worried that a cross-platform build system that has all the features 
that he wants is simply beyond the scope of 1-2 people unless they work on it 
full time for a year or two.
...
yeah, currently 1/5th of science packages use C/C++/fortran/cython etc
(see http://pypi.python.org/pypi?:action=browse&c=40 110/458 on that
page ).  There seems to be a lot more using C/C++ compared to other
types of pakages on there (eg zope3 packages list 0 out of 900
packages using C/C++).
So the hight number of C/C++ science related packages on pypi
demonstrate that better C/C++ tools for scientific packages is a big
need.  Especially getting compile/testing farms for all these
packages.  Getting compile farms is a big need compared to python
packages - since C/C++ is MUCH harder to write/test in a portable way.
 I would say it is close to impossible to get code to work without
quite good knowledge on multiple platforms without errors.
Not sure that that is quite true. C++ is not a very popular language around 
here, but the combination of boost+Qt+python+scipy+hdf5+h5py has made 
virtually all of my platform-specific code vanish (with the exception of some 
platform-specific stuff in my CMake scripts).

Regards,
Ravi