Python patch analog
Terry Hancock
hancock at anansispaceworks.com
Wed Jan 8 16:51:16 EST 2003
I am working on a library class for Python which implements version control
for an object. It relies on external persistence (probably ZODB in
practice, but I don't provide persistence internally).
It works on generic sequences of comparable data (I think this means
hashable, but all I really require is that '==' works and is meaningful,
and that the sequence works with Python's existing difflib). In the
future, I will probably add the ability to use recursive dictionaries of
sequences (i.e. like directories), but right now it just works with
sequences -- or rather data that can be marshalled in and out of sequences
by well-defined, reversible, object-type-specific split() and join()
operations.
At first, I thought that I needed to emulate the Unix patch utility and
then CVS ("ontology follows philogeny"?). So I was trying to do "context
patching" correctly, which I can't do using regular expressions because we
are NOT talking (necessarily) about strings of characters anymore (you
could argue that I have to reimplement some of RE to act on non-string
sequences).
Difflib can generate a sequence description which can be interpreted as a
program to patch the sequence to the newly desired state. Implementing a
"patch" based on this is pretty simple. But it is somewhat brittle, as it
does not take into account that the order of applying patches or that the
underlying sequence may have changed in some unpredicted way.
The original "patch" uses context to sort out where a patch actually goes,
because the file may have been changed by other means between when the diff
was computed and when it is to be applied as a patch. CVS is apparently
based on this approach as well, although I haven't checked its sources.
For clarity, and lack of "correct" terminology, I will refer to this
context based approach as "context patching" and the purely numerical
approach of raw difflib (plus changes I will add below) as "precision
patching". If you know better/more-accepted terms, please substitute and
enlighten me, so I can make my documentation easier to understand.
But I discovered something interesting (and to me surprising) -- it seems
to me that I don't need context patches at all, because I am keeping
records on the object state. Let me elaborate:
First, for simplicity, consider the case where only one person works on the
object at a time (i.e. no concurrent check-outs).
All we have to do is keep track of what patches have been submitted along
with the date and time they were submitted. As we don't allow concurrency,
the patch is always against the most recent version of the file.
But suppose one patch is in error, so we choose to remove it. Won't that
make the whole house of cards fall, because every patch after that was
computed assuming that the missing patch had acted?
Well, yes, but it turns out to be very easy to compute the effect of not
including the patch, and numerically shift all subsequent patch
instructions to work as expected (in the case of non-overlapping patches,
the effect is a simple "position and delta" effect which "pushes" or
"pulls" the remainder of the file by the difference between the removed and
substituted elements of the sequence -- for overlapping patches it's
stickier, but there's still a reasonable solution). So, as long as I
actually have the missing patch to analyze, I can compute this -- i.e. I
don't "throw away" a bad patch, but "ignore" it, precisely computing the
resulting effect.
Okay, so, with no concurrency, we have no problem. Now let's see what we
can do with concurrent check-ins.
How I interpret "concurrent check ins", is that essentially, we remove the
guarantee that the submitted content is to be diffed against the most
recent version of the sequence. Instead, it is diffed against a specific,
known, previous version.
This means that the sequence has changed on us. So, now, do we need context?
No. Because we can use the inverse of the "ignore" code to compute the
positional effect of *acknowledging* all changes that occured **after the
checkout, but before the check in**. This can be used to precisely shift
the to-be-patched submission before actually applying it.
This works even if there's more than one check in coming from different
sources, so long as each source can be identified as working from a
specific check-out time. (You just apply each one in the order of
submission, applying the rule in the previous paragraph to them before
applying them).
I include all of these numerical-shifting behaviors into the idea of
"precision patching" as opposed to "context patching".
I believe it is still possible to identify occasional conflicts with this
mechanism, and that the same conceptual approach that CVS uses can be
applied to them (or perhaps it needs more thought, but in any case, is
outside the scope of this problem). The important point is that conflicts
shouldn't happen any more this way than with context diffs (and possibly,
they will occur much less often, depending on the data).
But now I can't think of any reason why I would need to use context
patching. Is there a use case I'm missing here?
Now "why on Earth would I want to ditch tradition and do it this way?", you
might ask. Well, aside from a certain elegance (I might even say
"pythonic-ness" ;-D), the reason is that context patching can get into
trouble if the context changes too rapidly or if there are too many changes
to a file going on (so that conflicts happen often). Also, implementing the
context checking is tricky, and probably error-prone intrinsically (whereas
precision patching is basically just simple arithmetic).
Empirically, the existing Unix patch does a good job on program text data,
and it's possible that it would do fine in general, though I don't know of
any kind of proof of this (which, of course, doesn't mean there isn't one,
I could just be ignorant). Anyway, the point is that we're talking now
about *generic sequence data*, not lines of a file, so some assumptions may
fall flat. Context may not be particularly meaningful in a list of
graphical drawing objects, or some other binary data that I'm not familiar
with.
So, I think that this "precision" approach may be a better bet practically,
as well as being simpler.
But, I have to confess that the design is probably more brittle. I just
can't think of any way to break it in this application. Any dissenting
opinions out there? Am I being thick, or have I really found a simpler way
to do versioning?
Cheers,
Terry
--
Anansi Spaceworks
http://www.anansispaceworks.com
More information about the Python-list
mailing list