A simply newbie question about ndiff

Tim Peters tim.one at comcast.net
Mon Apr 22 12:44:29 EDT 2002


[Neville Franks]
> Actually my entire reason for getting into Python right now was
> only to look at your ndiff code to see the sort of results it
> delivered and also get a feel for its space/time usage.

It reads both files into RAM, and builds a dict mapping each non-junk line
in "the second file" to a list of all indices in the second file at which
the line appears.  So space is proportional to the file sizes.  Time
complexity is extremely data dependent.

> I only discovered your ndiff over the weekend, which is a little
> surprising as I've also been working on Diff's for quite a few
> years now as part of my programmer's editor, ED for Windows.

Cool.  ndiff will do a fine job for you, but if you're slinging giant files
you'll need to compromise.

> I've been working on a major rewrite of ED4W over the past 3+ years
> and I did look at replacing my diff code with something else which
> would hopefully be a bit better (results/time/memory),

ndiff is after results first, memory second, time last.

> but didn't find anything that fitted the bill. I did have a look at
> fcomp which produced good results quickly for small files, but managed
> to use all available memory and time for larger files. Hence the
> interest in ndiff.
> ...
> I wonder how it would perform in C++. From what you've said it
> doesn't sound like their would be much of performance gain.

You're welcome to try it <wink>.

> FYI on the 73,000 line file I diffed there was only one line different
> in the two files. ED's diff took approx. 16 seconds and ndiff I think
> around 2.5 minutes.

There's one trick you can probably afford that ndiff won't do:  many diff
algorithms start by stripping the longest common prefix and suffix, applying
the workhorse algorithm only to what remains after that.  ndiff does not,
for reasons explained somewhere in a comment.  Instead it gets the effect of
this (specifically, SequenceMatcher does):

    for every element X in sequence 1:
        for every (non-junk) element Y in sequence 2:
           find the longest stretch of equal elements starting at X and Y
           if that's the longest stretch so far, remember it

Now SequenceMatcher uses a much cleverer implementation than that
brute-force triply-nested loop (which, as written, has worst-case cubic
time), but it can still take substantial time, and 70K lines is ridiculously
out of the range of sizes ndiff was intended to handle.  70K is in the range
of a very long novel; the ndiff algorithm was first used to create
human-readable patches for source code files, and later for plain-ASCII
design documents.  Those typically take a few seconds.

For giant files, stripping common prefix and suffix first shouldn't hurt
quality much, and would speed things enormously.  For example,

# Return (i, m), (i, n) s.t.
#     x[:i] == y[:i]
# and
#     x[m:] == y[n:]
# IOW, find the indices of the common prefix and suffix.
# x[i:m] and y[i:n] are the mismatching parts.

def chop(x, y):
    m, n = len(x), len(y)
    smaller = min(m, n)
    i = 0
    while i < smaller and x[i] == y[i]:
        i += 1
    while m > i and n > i and x[m-1] == y[n-1]:
        m -= 1
        n -= 1
    return (i, m), (i, n)

If you fed your 70K-line files thru that first, it should identify the
1-line mismatching regions in a fraction of a second (even leaving the code
in Python).  Etc <wink>.






More information about the Python-list mailing list