[Numpy-discussion] DVCS at PyCon

Sat Apr 11 23:28:53 EDT 2009

2009/4/11 Stéfan van der Walt <stefan at sun.ac.za>:

> From my POV, the current system is very unproductive and, while
> git-svn makes life a bit easier, it comes with its own set of
> headaches.  Especially now that we are evaluating different
> work-flows, we need the right kind of vcs to back it up.

Please take the following with a big grain of salt, because I haven't
used either git nor hg beyond trivial cloning.  But I do have by now
pretty extensive experience with bzr (ipython and nipy), so I think
I've developed a decent intuition on what's important/useful in the
long run from DVCS workflows.  I have also been reading a fair amount
about git and hg, in particular about the core of their internal
models.  This, it seems to me, is actually far more important in the
long run than their top-layer polish.

>From that perspective, right now my intuition (and yes, it's only that
for now) tells me that git has some really appealing features for a
dvcs, which as far as I can see are not present in hg (seeing as
unless I've grossly misunderstood it, it is much closer to bzr in its
internal model than to git).  To me the key point in git of
fundamental value is its direct manipulation of the commit DAG and
history: this is something that I think one only comes to appreciate
after using a DVCS for *a reasonably long time* on a *reasonably
complex project* with multiple developers, branches and merges.  I
stress this because I think these points really only become apparent
under such conditions, at least I didn't really think of these things
until I used bzr extensively for ipython.

Let me elaborate a bit.  One of the main benefits of a DVCS is that it
enables all developers to be aggressive locally, to experiment on
crazy ideas and to use the VCS as their safety line in their
experimentation.  You are free to try crazy things, commit as often
and finely-grained as you want, and if things go wrong, you can
backtrack easily.  But in general what happens is that things don't
simply go wrong: you often end up making things work, it's just that
the intermediate history can look totally crazy, with tons of
intermediate commits that are really of no interest anymore to anyone.
 With git, there is a way of saying "merge all this into a single
commit (or a few)" so that it goes into the upstream project into
chunks that make logical sense and not just that reflect your
tiptoeing during a tricky part of the development.

In bzr (and as far as I see, also in hg),  this kind of history
rewriting is near impossible, so the best you can do is make a merge
commit and leave all that history in there, visible in the 'second
level' log (indented in the text view).  As a project grows many
developers, having all this history merged back into the main project
tree gets unwieldy.

>From my (now reasonably extensive) experience with bzr, it really
feels like a system that took the centralized VCS model and put 'a
little svn server everywhere you need one'.  That is, the
repository/branch/history model retains the rigidity of a centralized
VCS, it's just that you can have it anywhere, and it can track
branching and merging intelligently.  There's certainly a lot of value
in that, I am not denying it in the least bit.

However,  git seems to really make the key conceptual jump of saying:
once you have a truly distributed development process, that rigid
model just breaks down and should be abandoned.  What you need to
accept is that the core objects you should manipulate are the atomic
change units needed to reconstruct the state of the project, and the
connectivity between those units.  If you have tools to manipulate
said entities,  you'll be able to really integrate the work that many
people may be doing on the same objects in disconnected ways, back
into a single coherent entity.

Sorry if this seems a bit in the air, but I've been thinking about
this for the past couple of days, and I figured I'd share.  I don't
mean this to be a bashing of hg or bzr (which we'll continue using for
ipython at least for a long while, since now is not the time for yet
another workflow change for us).  But from *my* perspective, git
offers really the correct abstractions to think about distributed
collaborative workflows, while the other systems simply seem to offer
tools to distribute the workflow of a rigid development history (a la
CVS) to multiple developers. There's a fundamental difference between
thosee two approaches, and I think it's a critically important one.

As for what numpy/scipy should do, I'll leave that to those who
actually contribute to the projects :)  I just hope that this view is
useful to anyone who wants to think about the problem from an angle
different to that of specific commands, benchmarks or features :)

All the best,

f