Python patch analog

Terry Hancock hancock at anansispaceworks.com
Wed Jan 8 16:51:16 EST 2003


I am working on a library class for Python which implements version control 
for an object. It relies on external persistence (probably ZODB in 
practice, but I don't provide persistence internally).

It works on generic sequences of comparable data (I think this means 
hashable, but all I really require is that '==' works and is meaningful, 
and that the sequence works with Python's existing difflib).  In the 
future, I will probably add the ability to use recursive dictionaries of 
sequences (i.e. like directories), but right now it just works with 
sequences -- or rather data that can be marshalled in and out of sequences 
by well-defined, reversible, object-type-specific split() and join() 
operations.

At first, I thought that I needed to emulate the Unix patch utility and 
then CVS ("ontology follows philogeny"?).  So I was trying to do "context 
patching" correctly, which I can't do using regular expressions because we 
are NOT talking (necessarily) about strings of characters anymore (you 
could argue that I have to reimplement some of RE to act on non-string 
sequences).

Difflib can generate a sequence description which can be interpreted as a 
program to patch the sequence to the newly desired state.  Implementing a 
"patch" based on this is pretty simple. But it is somewhat brittle, as it 
does not take into account that the order of applying patches or that the 
underlying sequence may have changed in some unpredicted way.

The original "patch" uses context to sort out where a patch actually goes, 
because the file may have been changed by other means between when the diff 
was computed and when it is to be applied as a patch.  CVS is apparently 
based on this approach as well, although I haven't checked its sources.

For clarity, and lack of "correct" terminology, I will refer to this 
context based approach as "context patching" and the purely numerical 
approach of raw difflib (plus changes I will add below)  as "precision 
patching".  If you know better/more-accepted terms, please substitute and 
enlighten me, so I can make my documentation easier to understand.

But I discovered something interesting (and to me surprising) -- it seems 
to me that I don't need context patches at all, because I am keeping 
records on the object state.  Let me elaborate:

First, for simplicity, consider the case where only one person works on the 
object at a time (i.e. no concurrent check-outs).

All we have to do is keep track of what patches have been submitted along 
with the date and time they were submitted. As we don't allow concurrency, 
the patch is always against the most recent version of the file.

But suppose one patch is in error, so we choose to remove it. Won't that 
make the whole house of cards fall, because every patch after that was 
computed assuming that the missing patch had acted?

Well, yes, but it turns out to be very easy to compute the effect of not 
including the patch, and numerically shift all subsequent patch 
instructions to work as expected (in the case of non-overlapping patches, 
the effect is a simple "position and delta" effect which "pushes" or 
"pulls" the remainder of the file by the difference between the removed and 
substituted elements of the sequence -- for overlapping patches it's 
stickier, but there's still a reasonable solution).  So, as long as I 
actually have the missing patch to analyze, I can compute this -- i.e. I 
don't "throw away" a bad patch, but "ignore" it, precisely computing the 
resulting effect.

Okay, so, with no concurrency, we have no problem.  Now let's see what we 
can do with concurrent check-ins.

How I interpret "concurrent check ins", is that essentially, we remove the 
guarantee that the submitted content is to be diffed against the most 
recent version of the sequence. Instead, it is diffed against a specific, 
known, previous version.

This means that the sequence has changed on us. So, now, do we need context?

No. Because we can use the inverse of the "ignore" code to compute the 
positional effect of *acknowledging* all changes that occured  **after the 
checkout, but before the check in**.  This can be used to precisely shift 
the to-be-patched submission before actually applying it.

This works even if there's more than one check in coming from different 
sources, so long as each source can be identified as working from a 
specific check-out time.  (You just apply each one in the order of 
submission, applying the rule in the previous paragraph to them before 
applying them).

I include all of these numerical-shifting behaviors into the idea of 
"precision patching" as opposed to "context patching".

I believe it is still possible to identify occasional conflicts with this 
mechanism, and that the same conceptual approach that CVS uses can be 
applied to them (or perhaps it needs more thought, but in any case, is 
outside the scope of this problem).  The important point is that conflicts 
shouldn't happen any more this way than with context diffs (and possibly, 
they will occur much less often, depending on the data).

But now I can't think of any reason why I would need to use context 
patching.  Is there a use case I'm missing here?

Now "why on Earth would I want to ditch tradition and do it this way?", you 
might ask.  Well, aside from a certain elegance (I might even say 
"pythonic-ness" ;-D), the reason is that context patching can get into 
trouble if the context changes too rapidly or if there are too many changes 
to a file going on (so that conflicts happen often). Also, implementing the 
context checking is tricky, and probably error-prone intrinsically (whereas 
precision patching is basically just simple arithmetic).

Empirically, the existing Unix patch does a good job on program text data, 
and it's possible that it would do fine in general, though I don't know of 
any kind of proof of this (which, of course, doesn't mean there isn't one, 
I could just be ignorant).  Anyway, the point is that we're talking now 
about *generic sequence data*, not lines of a file, so some assumptions may 
fall flat.  Context may not be particularly meaningful in a list of 
graphical drawing objects, or some other binary data that I'm not familiar 
with.

So, I think that this "precision" approach may be a better bet practically, 
as well as being simpler.

But, I have to confess that the design is probably more brittle. I just 
can't think of any way to break it in this application.  Any dissenting 
opinions out there?  Am I being thick, or have I really found a simpler way 
to do versioning?

Cheers,
Terry

-- 
Anansi Spaceworks
http://www.anansispaceworks.com




More information about the Python-list mailing list