[Python-Dev] Fwd: PEP: Migrating the Python CVS to Subversion

Sun Aug 14 20:13:29 CEST 2005

Here's another POV. (Why does evereybody keep emailing me personally?)

--Guido van Rossum (home page: http://www.python.org/~guido/)

---------- Forwarded message ----------
From: Daniel Berlin <dberlin at dberlin.org>
Date: Aug 13, 2005 7:33 PM
Subject: Re: [Python-Dev] PEP: Migrating the Python CVS to Subversion
To: gvanrossum at gmail.com

(Sorry for the lack of proper References: headers, this is a reply to
the email archive).

It's been a couple years since i've been in the world of python-dev, but
apparently i'm rejoining the mailing list at just the right time.

Take all of this for what it is worth:

I'm currently responsible for GCC's bugzilla, wiki, in addition to
maintaining several optimization areas of the compiler :P.
In addition, i'm responsible for pushing GCC (my main love and world :P)
towards Subversion.  I should note my bias at this point.  I now have
full commit access to Subversion.  However, I've also submitted patches
to monotone, etc.

We had a long thread about the various alternatives (arch, bzr, etc),
and besides "freeness" constraints on what we can run on gcc.gnu.org as
an FSF project, it wouldn't have mattered anyway.  This has been in the
planning for about a year now (mainly waiting for new hardware).

Originally, we were hoping to move GCC to monotone, but it didn't mature
fast enough (it's way too slow), and we couldn't make it centralized
enough for our tastes (more later).

The rest of the free tools other than subversion (arch, monotone, git,
darcs, etc) simply couldn't handle our repository with reasonable
speed/hardware.  GCC has project history dating back to 1987.  It's a 4
gig CVS repo containing > 1000 tags, and > 300 branches.  The changelog
alone has 30k trunk revisions.   Those distributed systems that carry
full history often can't deal with this fast enough or in any space
efficient way.  arch was an example of this.  It had a retarded
mechanism that forced users to care about caching certain revisions to
speed it up , instead of doing it on it's own.  I've never tried
converting this repo to bazaar-ng, it wasn't far enough along when i
started.  It also had no cvs2bzr type program, and we aren't about to
lose all our history.  Except for monotone (builtin cvs_import) and
subversion (cvs2svn), none of the cvs2* programs i've run across either
run in reasonable time (those that don't actually understand how to
extract rcs revisions would take weeks to convert our repo, literally),
or could handle all the complexities a repository with our history
present (branches of branches, etc).  Most simply crash in weird ways or
run out of memory :).

Anyway:
Monotone took 45 minutes just to diff two old revisions that are one
revision away from each other.
CVS takes about 2 minutes for the same operation.
SVN on fsfs takes 4 seconds.

The converted SVN repo has > 100000 revisions, and is only ~15% bigger
than the cvs repo (mostly due to stupid copies it has to do to handle
some tag fuckery people were doing in some cases.  If it had been
subversion from the start, it would have been smaller).

We have cvs2svn speedup patches that were done with the KDE folks that
make cvs2svn io bound again instead of cpu bound (it was O(N^2) in
extracting cvs revision texts before).  It takes 17 hours to convert the
gcc repository now, only 45 minutes of cpu time :).  It used to take 52
hours.

I've also talked with Linus about version control before.  He believes
extreme distributed development is the way to go.  I believe heavily
that in most cases where you have a mix of corporations and free
developers, it ends up causing people to "hide the ball" more than they
should.  This is particularly prevalent in GCC.  We don't want design
and development done and then sent as mega-patches presented as fait
accompli, then watch these people whine as their designs get torn apart.
We'd rather have the discussion on the mailing list and the work done in
a visible place (IE CVS branch stored in some central place) rather than
getting patch bombs.  As a result (and there are many other reasons, i'm
just presenting one of them), we actually don't *want* to move from a
centralized model, in order to help control the social and political
problems we'd have to face if we went fully distributed.

Python may not face any of these problem to the degree that GCC does (i
doubt many projects do actually. GCC is a very weird and tense political
situation :P), because of size, etc, in which case, a distributed model
may make more sense.  However, you need to be careful to make sure that
people understand that it hasn't actually changed your real development
process (PEP's, etc), only the workflow used to implement it.