[Python-3000] How to override io.BytesIO and io.StringIO with their optimized C version?

Fri Dec 28 17:40:36 CET 2007

I just wrote up a summary of Mercurial and distributed version control
systems for a coworker.  I figured I'd add it to the discussion in case it
helps get some of the main ideas of these systems across...

-Damon

-----

Mercurial (hg) is one of a new breed of "distributed" version control
systems, which I think have a lot of advantages over traditional systems
that have a single copy of the repository.  I am using it for my own
personal software these days and highly recommend it.  If you've heard of or
used svk with svn, then you may be familiar with a few of the ideas
involved.

*Basic Ideas of Distribute Version Control*

Here are a some points that should give a flavor for what distributed
version control and hg are all about:

   - It turns out that with the right representation of a repository on
   disk, you can usually store the entire history of a project in not much more
   space than it takes to store the current snapshot of the project.  And in
   some cases, you can store the entire history in less space. (also note that
   the size of compiled libraries and executables often dwarfs the size of the
   source + history.
   - Given this, it makes sense for each developer to have a copy of the
   entire repository on their local hard drive.  Then they can do a series of
   commits on their laptop hard drive without waiting for any kind of network
   activity (or even while they're on a plane) and then push the changes onto
   the "master" repository at a later date without losing all the incremental
   local revisions.  Besides mobile/disconnected development, the major benefit
   of a local repository is that it makes most operations effectively
   instantaneous.  Even some of the slowest operations, such as large
   (entire-project) diffs become quick.
   - You take the hit for downloading the repository once, up front.
   From then on, compressed diffs (just like those used in the repository
   itself) are sent over the network.  The diffs for a revision also come with
   a cryptographic signature so that when they are expanded you can be sure
   they're not corrupted (more on that later).
   - There is a separation/independence between the most recent revision
   of the project that your repository knows about and the state of the source
   files on your disk.  You can make your source reflect a previous state of
   the project and then jump forward to the most current state without
   connecting to the master repository.  This separation also means that
   downloading a set of revisions from the "master" repository (for instance,
   all the changes that have occurred since your last download) is a separate
   step from actually updating your source files to reflect those changes.
   This way, if you're in the middle of some local changes, you can download
   the latest revisions without merging them into your source, disconnect from
   the network, keep working on your local changes, and then, only when you're
   happy with your changes, you can merge them with those that you downloaded.

   - Since repositories are small and copying local files is fast,
   there's low overhead for having multiple copies of a repository on your
   drive.  So a branch is implemented as a copy of a repository, along with a
   pointer to the original (this is very fast because hard links are used for
   the files that make up the repository itself, with copy-on-write semantics
   for later changes).  You can do independent changes in an independent
   repository and then merge the results when you're done.  Once again, you can
   do multiple experimental commits on each repository without worries of
   polluting the master repository with a record of those changes until you're
   sure they're worth pushing back.
   - Changes can be either pulled from another repository or pushed to
   it.  Again, the separation between the set of revisions a repository knows
   about and the state of the source files on the disk means that the owner of
   the repository gets to review the changes before making them official.
   - Since multiple branches are the norm, merging is a common
   operation.  As much merge support as possible is included in the tool and
   the rest is easy to plug in via standard interactive merge tools.  The
   result is that merges tend to be quite easy and when your lines of code
   don't overlap you typically don't even have to think about them.
   - Given that each developer has a copy of the entire repository,
   there's no reason that two developers have to communicate changes through
   the centralized repository.  If I have changes you want, you can pull them
   from my repository, preserving all the incremental revisions and comments
   that I've made, and then push the entire thing back to the master when
   you're done.  The history of branches and merges, including who did what, is
   kept so that the history reflects what actually happened (including the fact
   that there were 2 parallel branches for a while).
   - By cryptographically signing each "change set" (not only the state
   of the files in that revision but also the history of the revision), the
   revision control system can give you a single unique 40-digit "change set
   number" that can be verified against an actual repository to guarantee that
   it is not corrupted in any way.  (For day-to-day use, a developer only needs
   to use the first few digits of the number—enough to be unambiguous within
   that repository, or can instead use simple, consecutive revision numbers.)
   - The final idea is that in this model, there's nothing special about
   the master repository.  Every repository is has the same capabilities as
   every other, and it's up to a project to assign any special significance to
   one or more repositories.
   - Every file required by the version control system can be stored in a
   single hidden (.hg) file at the top directory of the repository.  This is
   the single point of file namespace pollution for using the system in a
   project.

That's probably enough to give the basic idea.

*User Interface  *

As far as the interface goes, you type "hg init" at the top level of a
project directory containing your source files to create a boilerplate
"blank-slate" .hg directory.  You type "hg status" see which new files
Mercurial sees in your directory (all files show up as new files).  You
create a .hgignore file at the top level and add glob- or regex-style
patterns to tell Mercurial about the files that it should not track (object
files, backup files, etc.).  You type "hg status" to confirm that the set of
files Mercurial sees are the ones you want to add to the project and then
"hg add" to add them all (you can also add a subset with "hg add
<filenames>").  Finally, you do "hg commit" ("hg ci" is equivalent).  At
this point you have your first revision and a "hg status" should show
nothing (or whatever files you chose not to add if you didn't add them all
for the first commit).  Typing "hg clone <hg_dir>" in a completely separate
directory will create a copy (branch) of your project there.  "hg clone
<url>" or "hg clone ssh://user@host:22/<hg_dir>" will create a branch of a
network-visible project.  "hg pull [<repository>]" will copy/download any
new revisions from the named repository, which defaults to the one you
branched from.  "hg update" is required to bring the state of your sandbox
in sync with the downloaded revisions.  An "hg merge" may be needed if there
are conflicts.  "hg push" will push changes back up to the place your
directory was cloned from (again, an "hd update" and "hg commit" are
required on the other end to make those changes official).

Mercurial also has a built-in web server that can be started if you want
people on a shared network to be able to browse your repository that way.
In addition, it comes with a graphical tool (hgk) which allows you to see
the history of a project including branches and merges.  I'm fond of using
tkdiff to diff my sandbox with a repository so I hacked a copy of tkdiff to
do this (perhaps by now the official one supports it as well.)

One more cool feature I have to mention: Mercurial has a "bisect" command
that you can use for finding when a bug was introduced.  You start bisect,
which chooses a revision of the code for you.  You run your regression test
and run bisect again, telling it whether your test succeeded or failed.
This chooses a new revision of the code using a binary search.  In a small
number of iterations, you find the change that broke the code.  Clearly,
with an automated test this is easy to automate.  I haven't used the command
yet, but am looking forward to it.  I think we should create something
similar for use with xcs, since this automates a useful process that many
people find prohibitively tedious.
*
Mercurial Weaknesses*

One weakness of mercurial is that it does not have support for storing
multiple projects (i.e. you might want to selectively check out a single
project without checking out the rest) in the same repository.  If your
project source includes large, independent subsystems and projects (a
situation that I haven't dealt with yet in my own use of hg), it sounds like
the way to handle this is to use relative symbolic links in the separate
projects.  But I've only read some references to doing this and I don't
completely understand it yet.

*Other Distributed Version Control Tools*

There's another system, called Git, that is probably just as good as
Mercurial.  Git was created by Linus Torvalds and is based on almost the
same set of ideas as mercurial.  Git is quite a bit faster for many
operations and uses a little less disk space, but last I checked it still
had bad Windows support and Mercurial had better documentation.  Otherwise,
they seem pretty similar in terms of robustness and features, although I
found git to be a bit more confusing on first glance because of the large
number of additional "plumbing" commands that it makes available. Recently I
took a second look and got the impression that there's a subset of the git
commands that are almost identical to the hg commands.  Although I
originally tried git first, I later tried Mercurial and (perhaps because of
the better documentation) never went back.  I was also  impressed by the
fact that the original mercurial source code was just a couple thousand
lines of pure python and was—at the time—within a factor of 2 of the speed
of git—I appreciate the engineering required to write something in an
elegant, concise way and still have it perform within an order of magnitude
of C code written by a master of ultra-efficient OS-level C.  The Mercurial
source is larger now, and includes 3 small (<500 lines each) C files to
speed up diffs, patch files, and some other low-level feature, but is still
pretty small (~20KLOC vs. ~90KLOC for Git).  In any case, both tools were so
many times faster than any version control I'd used in the past that I
really didn't care about that last percent of speed that git might give me.
The tools have shared a lot of ideas and even some code (e.g. hgk is derived
from a Git gui called gitk, git has copied some of hg's features) and I
expect this to continue since they're both developed by active members of
the linux kernel community.

The rest of the distributed version control systems I know of are: bzr
("Bazaar", written in Python), monotone, GNU Arch, and darcs (written in
Haskell).  Darcs is supposed to be conceptually different from the others
but I really don't know much about it.  All of these others had
significantly lower performance than git or hg last I checked, but things
are changing fast.  Bzr is supposed to have a slightly easier command line
interface, but I find Mercurial's to be pretty easy already.  There are
tools for migrating projects from each of these to any of the others.

SVK is a tool for use with SVN that lets you have a local repository.  I
think it's kind of like an svn  repository on one side (the user side) and
an svn client on the other (the side that talks to the real SVN server).  A
good friend uses it and recommends it for SVN users who want to be able to
do "offline" publishes.  My understanding, however, is that it doesn't
provide any of the other features of the tools above.

*Migrating to Mercurial*

There are a few tools for migrating projects from SVN and CVS.  The one
called "hgsvn" seemed like the best for SVN last time I checked.  The
original import is kind of slow, though, since it has to do something like
check out each revision from the SVN server.

Some people seem to feel productive using hg locally while publishing to a
cvs or svn server.  I'm not sure how that works though.

*Links*

Official Page - http://www.selenic.com/mercurial/wiki/
Tutorial - http://www.selenic.com/mercurial/wiki/index.cgi/Tutorial
Book - http://hgbook.red-bean.com/hgbook.html

Google Tech Talk about Hg: http://www.youtube.com/watch?v=JExtkqzEoHY
Linus' egotistical talk about Git:
http://www.youtube.com/watch?v=4XpnKHJAok8
Randal Schwartz' talk about Git: http://www.youtube.com/watch?v=8dhZ9BXQgc4

Performance Benchmarks (of varying quality and age) -
  http://weblogs.mozillazine.org/jst/archives/2006/11/vcs_performance.html

http://weblogs.mozillazine.org/jst/archives/2007/02/bzr_and_different_network_prot.html
  http://git.or.cz/gitwiki/GitBenchmarks
  https://lists.ubuntu.com/archives/bazaar/2006q2/011953.html

-Damon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20071228/b69e7194/attachment-0001.htm