[Python-3000] How to override io.BytesIO and io.StringIO with their optimized C version?
Damon McCormick
damonmc at gmail.com
Fri Dec 28 17:40:36 CET 2007
I just wrote up a summary of Mercurial and distributed version control
systems for a coworker. I figured I'd add it to the discussion in case it
helps get some of the main ideas of these systems across...
-Damon
-----
Mercurial (hg) is one of a new breed of "distributed" version control
systems, which I think have a lot of advantages over traditional systems
that have a single copy of the repository. I am using it for my own
personal software these days and highly recommend it. If you've heard of or
used svk with svn, then you may be familiar with a few of the ideas
involved.
*Basic Ideas of Distribute Version Control*
Here are a some points that should give a flavor for what distributed
version control and hg are all about:
- It turns out that with the right representation of a repository on
disk, you can usually store the entire history of a project in not much more
space than it takes to store the current snapshot of the project. And in
some cases, you can store the entire history in less space. (also note that
the size of compiled libraries and executables often dwarfs the size of the
source + history.
- Given this, it makes sense for each developer to have a copy of the
entire repository on their local hard drive. Then they can do a series of
commits on their laptop hard drive without waiting for any kind of network
activity (or even while they're on a plane) and then push the changes onto
the "master" repository at a later date without losing all the incremental
local revisions. Besides mobile/disconnected development, the major benefit
of a local repository is that it makes most operations effectively
instantaneous. Even some of the slowest operations, such as large
(entire-project) diffs become quick.
- You take the hit for downloading the repository once, up front.
From then on, compressed diffs (just like those used in the repository
itself) are sent over the network. The diffs for a revision also come with
a cryptographic signature so that when they are expanded you can be sure
they're not corrupted (more on that later).
- There is a separation/independence between the most recent revision
of the project that your repository knows about and the state of the source
files on your disk. You can make your source reflect a previous state of
the project and then jump forward to the most current state without
connecting to the master repository. This separation also means that
downloading a set of revisions from the "master" repository (for instance,
all the changes that have occurred since your last download) is a separate
step from actually updating your source files to reflect those changes.
This way, if you're in the middle of some local changes, you can download
the latest revisions without merging them into your source, disconnect from
the network, keep working on your local changes, and then, only when you're
happy with your changes, you can merge them with those that you downloaded.
- Since repositories are small and copying local files is fast,
there's low overhead for having multiple copies of a repository on your
drive. So a branch is implemented as a copy of a repository, along with a
pointer to the original (this is very fast because hard links are used for
the files that make up the repository itself, with copy-on-write semantics
for later changes). You can do independent changes in an independent
repository and then merge the results when you're done. Once again, you can
do multiple experimental commits on each repository without worries of
polluting the master repository with a record of those changes until you're
sure they're worth pushing back.
- Changes can be either pulled from another repository or pushed to
it. Again, the separation between the set of revisions a repository knows
about and the state of the source files on the disk means that the owner of
the repository gets to review the changes before making them official.
- Since multiple branches are the norm, merging is a common
operation. As much merge support as possible is included in the tool and
the rest is easy to plug in via standard interactive merge tools. The
result is that merges tend to be quite easy and when your lines of code
don't overlap you typically don't even have to think about them.
- Given that each developer has a copy of the entire repository,
there's no reason that two developers have to communicate changes through
the centralized repository. If I have changes you want, you can pull them
from my repository, preserving all the incremental revisions and comments
that I've made, and then push the entire thing back to the master when
you're done. The history of branches and merges, including who did what, is
kept so that the history reflects what actually happened (including the fact
that there were 2 parallel branches for a while).
- By cryptographically signing each "change set" (not only the state
of the files in that revision but also the history of the revision), the
revision control system can give you a single unique 40-digit "change set
number" that can be verified against an actual repository to guarantee that
it is not corrupted in any way. (For day-to-day use, a developer only needs
to use the first few digits of the number—enough to be unambiguous within
that repository, or can instead use simple, consecutive revision numbers.)
- The final idea is that in this model, there's nothing special about
the master repository. Every repository is has the same capabilities as
every other, and it's up to a project to assign any special significance to
one or more repositories.
- Every file required by the version control system can be stored in a
single hidden (.hg) file at the top directory of the repository. This is
the single point of file namespace pollution for using the system in a
project.
That's probably enough to give the basic idea.
*User Interface *
As far as the interface goes, you type "hg init" at the top level of a
project directory containing your source files to create a boilerplate
"blank-slate" .hg directory. You type "hg status" see which new files
Mercurial sees in your directory (all files show up as new files). You
create a .hgignore file at the top level and add glob- or regex-style
patterns to tell Mercurial about the files that it should not track (object
files, backup files, etc.). You type "hg status" to confirm that the set of
files Mercurial sees are the ones you want to add to the project and then
"hg add" to add them all (you can also add a subset with "hg add
<filenames>"). Finally, you do "hg commit" ("hg ci" is equivalent). At
this point you have your first revision and a "hg status" should show
nothing (or whatever files you chose not to add if you didn't add them all
for the first commit). Typing "hg clone <hg_dir>" in a completely separate
directory will create a copy (branch) of your project there. "hg clone
<url>" or "hg clone ssh://user@host:22/<hg_dir>" will create a branch of a
network-visible project. "hg pull [<repository>]" will copy/download any
new revisions from the named repository, which defaults to the one you
branched from. "hg update" is required to bring the state of your sandbox
in sync with the downloaded revisions. An "hg merge" may be needed if there
are conflicts. "hg push" will push changes back up to the place your
directory was cloned from (again, an "hd update" and "hg commit" are
required on the other end to make those changes official).
Mercurial also has a built-in web server that can be started if you want
people on a shared network to be able to browse your repository that way.
In addition, it comes with a graphical tool (hgk) which allows you to see
the history of a project including branches and merges. I'm fond of using
tkdiff to diff my sandbox with a repository so I hacked a copy of tkdiff to
do this (perhaps by now the official one supports it as well.)
One more cool feature I have to mention: Mercurial has a "bisect" command
that you can use for finding when a bug was introduced. You start bisect,
which chooses a revision of the code for you. You run your regression test
and run bisect again, telling it whether your test succeeded or failed.
This chooses a new revision of the code using a binary search. In a small
number of iterations, you find the change that broke the code. Clearly,
with an automated test this is easy to automate. I haven't used the command
yet, but am looking forward to it. I think we should create something
similar for use with xcs, since this automates a useful process that many
people find prohibitively tedious.
*
Mercurial Weaknesses*
One weakness of mercurial is that it does not have support for storing
multiple projects (i.e. you might want to selectively check out a single
project without checking out the rest) in the same repository. If your
project source includes large, independent subsystems and projects (a
situation that I haven't dealt with yet in my own use of hg), it sounds like
the way to handle this is to use relative symbolic links in the separate
projects. But I've only read some references to doing this and I don't
completely understand it yet.
*Other Distributed Version Control Tools*
There's another system, called Git, that is probably just as good as
Mercurial. Git was created by Linus Torvalds and is based on almost the
same set of ideas as mercurial. Git is quite a bit faster for many
operations and uses a little less disk space, but last I checked it still
had bad Windows support and Mercurial had better documentation. Otherwise,
they seem pretty similar in terms of robustness and features, although I
found git to be a bit more confusing on first glance because of the large
number of additional "plumbing" commands that it makes available. Recently I
took a second look and got the impression that there's a subset of the git
commands that are almost identical to the hg commands. Although I
originally tried git first, I later tried Mercurial and (perhaps because of
the better documentation) never went back. I was also impressed by the
fact that the original mercurial source code was just a couple thousand
lines of pure python and was—at the time—within a factor of 2 of the speed
of git—I appreciate the engineering required to write something in an
elegant, concise way and still have it perform within an order of magnitude
of C code written by a master of ultra-efficient OS-level C. The Mercurial
source is larger now, and includes 3 small (<500 lines each) C files to
speed up diffs, patch files, and some other low-level feature, but is still
pretty small (~20KLOC vs. ~90KLOC for Git). In any case, both tools were so
many times faster than any version control I'd used in the past that I
really didn't care about that last percent of speed that git might give me.
The tools have shared a lot of ideas and even some code (e.g. hgk is derived
from a Git gui called gitk, git has copied some of hg's features) and I
expect this to continue since they're both developed by active members of
the linux kernel community.
The rest of the distributed version control systems I know of are: bzr
("Bazaar", written in Python), monotone, GNU Arch, and darcs (written in
Haskell). Darcs is supposed to be conceptually different from the others
but I really don't know much about it. All of these others had
significantly lower performance than git or hg last I checked, but things
are changing fast. Bzr is supposed to have a slightly easier command line
interface, but I find Mercurial's to be pretty easy already. There are
tools for migrating projects from each of these to any of the others.
SVK is a tool for use with SVN that lets you have a local repository. I
think it's kind of like an svn repository on one side (the user side) and
an svn client on the other (the side that talks to the real SVN server). A
good friend uses it and recommends it for SVN users who want to be able to
do "offline" publishes. My understanding, however, is that it doesn't
provide any of the other features of the tools above.
*Migrating to Mercurial*
There are a few tools for migrating projects from SVN and CVS. The one
called "hgsvn" seemed like the best for SVN last time I checked. The
original import is kind of slow, though, since it has to do something like
check out each revision from the SVN server.
Some people seem to feel productive using hg locally while publishing to a
cvs or svn server. I'm not sure how that works though.
*Links*
Official Page - http://www.selenic.com/mercurial/wiki/
Tutorial - http://www.selenic.com/mercurial/wiki/index.cgi/Tutorial
Book - http://hgbook.red-bean.com/hgbook.html
Google Tech Talk about Hg: http://www.youtube.com/watch?v=JExtkqzEoHY
Linus' egotistical talk about Git:
http://www.youtube.com/watch?v=4XpnKHJAok8
Randal Schwartz' talk about Git: http://www.youtube.com/watch?v=8dhZ9BXQgc4
Performance Benchmarks (of varying quality and age) -
http://weblogs.mozillazine.org/jst/archives/2006/11/vcs_performance.html
http://weblogs.mozillazine.org/jst/archives/2007/02/bzr_and_different_network_prot.html
http://git.or.cz/gitwiki/GitBenchmarks
https://lists.ubuntu.com/archives/bazaar/2006q2/011953.html
-Damon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20071228/b69e7194/attachment-0001.htm
More information about the Python-3000
mailing list