difflib and analyzing filetrees

Terry Hancock hancock at anansispaceworks.com
Tue Aug 27 12:48:36 CEST 2002

Hi all,

I was very intrigued when I discovered the difflib, and
later xmldiff modules for Python. Clearly, this begs
a bit of a question since in Unix, diff led to patch
which led to RCS and eventually CVS.

CVS is kind of decrepit after many years of being extended
and implemented in a somewhat low level language. This
restricts it in certain unpleasant ways -- for example
it still has a rather poor notion of the relationships
between files, structural changes in a source tree, etc.
It is also welded pretty solidly to conventional file
systems, and provides its own authentication and networking
instead of being (easily) embeddable in other systems.

In a project I'm working on, I need some version control
software, and I don't think wrapping CVS is the right
approach.  I don't really need 100% CVS functionality,
and the nature of CVS's implementation kind of gets in
the way (e.g. my "files" aren't really files but Zope
objects). Also, I want to implement some ideas like
parallel versions of objects (e.g. translations).

I've been thinking about how to tackle the problem of
analyzing two source trees representing changes including
the usual in-file differences, but also:

new directories
deleted directories
moved files
split files
merged files

which are not formally recognized by CVS (I know that, in
principle, they can be reduced to add, remove, and update
operations, but I think it would be more efficient to
track these kinds of changes following the way they
actually (or probably) happened -- this would become more
important for trees which experienced many structural

(I could rely on Zope to tell me what actually happened,
since it records that information, but I want to be
able to analyze the two and make a plausible guess at
what happened -- not necessarily perfectly optimal, but
which would make sense to a human reader, sort of like
how difflib itself works. Also I don't really want to
limit the application strictly to Zope, but would
prefer to keep the tree abstract, as is the input source
for difflib).

I think I have a take on how to do this, but I was
wondering whether someone's already done it.

Also, is there a "patch" for difflib? -- it seems like
it would be pretty simple to implement given the formats
that difflib can return.

I realize that reimplementing so much functionality
is kind of a heavy choice, but I think writing a wrapper
would most likely wind up being just as heavy. And I'm
not sure it really has to be that big to do the job
(no doubt I'll skimp on some of the fancier features).

Anyway, a few searches on Google didn't turn up any
modules like what I'm looking for  (there's xmldiff,
the various members of the wikiwiki family, and ViewCVS,
but none seems to really be what I'm talking about). So,
I think it's time to ask if someone out there is sitting
on something like this.

Alternatively, assuming that I'm really going to
write this, does anyone want to discuss or persuade me
to add some feature or other?  What should the API look
like?  I haven't actually written anything down yet,
so I'm pretty open to new ideas.

Any input appreciated, Thanks,

Terry Hancock
hancock at anansispaceworks.com       
Anansi Spaceworks                 
P.O. Box 60583                     
Pasadena, CA 91116-6583

More information about the Python-list mailing list