Unix diff algorithm in Python anyone?

Tim Peters tim_one at email.msn.com
Wed Oct 6 01:02:53 EDT 1999


[Janko Hauser, points out the vastly underused <wink>
    Tools/scripts/ndiff.py
]
> It does not work with patch, but can be used to patch files with it's
> own output.

It's really got nothing to do with any kind of patching.  From ndiff's
output, either of the input files can be recovered exactly, but that's all.

I wrote ndiff.py over a number of years, to address a specific need:  when
passing out new revisions of plain text specs, there's a real need to
annotate what changed.  "Minimal edit distance" tools like diff do a bad job
at this, for several reasons:

1. They point out inter-line differences, but not intra-line.  If e.g. a "2"
changes to a "3" in some line of a spec, you want that clearly marked.  It's
not enough to print the two lines and say "OK, there's *some* difference
here, and *you* guess how many and where".

2. Their mechanical zeal for finding minimal diffs leads to
counter-intuitive synchronization.  There are a few examples of that in
ndiff's comments.  ndiff's output is meant to be read by humans, so strives
to "synch up" in the same way a human editor would do it.

3. Relatedly, minimal edits synch up on "junk" (typically blank lines or
whitespace) if that leads to a shorter overall diff.  But people have no
interest in junk that happens to match, and find it much easier to read a
diff report that synchs up only on "real stuff" -- even if that makes the
diff longer.

That said, ndiff.py would be easy to twist into producing diff-like output:
keep the SequenceMatcher class, and throw away most of the rest of the
module.  SequenceMatcher computes a list of {equal, replace, insert, delete}
"opcodes", and it's straightforward to convert that into any kind of output
you want.  Set the IS_LINE_JUNK vrbl to "lambda x: 0", and it will produce
smaller diffs, more like diff's, but with no claim to being minimal.

otoh-if-you-want-"diff"-output-there's-a-reasonable-case-to-be-
   made-for-using-diff<wink>-ly y'rs  - tim






More information about the Python-list mailing list