Unix diff algorithm in Python anyone?
Tim Peters
tim_one at email.msn.com
Wed Oct 6 01:02:53 EDT 1999
[Janko Hauser, points out the vastly underused <wink>
Tools/scripts/ndiff.py
]
> It does not work with patch, but can be used to patch files with it's
> own output.
It's really got nothing to do with any kind of patching. From ndiff's
output, either of the input files can be recovered exactly, but that's all.
I wrote ndiff.py over a number of years, to address a specific need: when
passing out new revisions of plain text specs, there's a real need to
annotate what changed. "Minimal edit distance" tools like diff do a bad job
at this, for several reasons:
1. They point out inter-line differences, but not intra-line. If e.g. a "2"
changes to a "3" in some line of a spec, you want that clearly marked. It's
not enough to print the two lines and say "OK, there's *some* difference
here, and *you* guess how many and where".
2. Their mechanical zeal for finding minimal diffs leads to
counter-intuitive synchronization. There are a few examples of that in
ndiff's comments. ndiff's output is meant to be read by humans, so strives
to "synch up" in the same way a human editor would do it.
3. Relatedly, minimal edits synch up on "junk" (typically blank lines or
whitespace) if that leads to a shorter overall diff. But people have no
interest in junk that happens to match, and find it much easier to read a
diff report that synchs up only on "real stuff" -- even if that makes the
diff longer.
That said, ndiff.py would be easy to twist into producing diff-like output:
keep the SequenceMatcher class, and throw away most of the rest of the
module. SequenceMatcher computes a list of {equal, replace, insert, delete}
"opcodes", and it's straightforward to convert that into any kind of output
you want. Set the IS_LINE_JUNK vrbl to "lambda x: 0", and it will produce
smaller diffs, more like diff's, but with no claim to being minimal.
otoh-if-you-want-"diff"-output-there's-a-reasonable-case-to-be-
made-for-using-diff<wink>-ly y'rs - tim
More information about the Python-list
mailing list