difflib

Sun Apr 22 18:50:31 EDT 2001

[Mikkel Rasmussen]
> Has anybody got any references for the algorithm used in difflib.

Nope.  It's an algorithm I dreamed up in the early 80's, while at Cray
Research, writing a file-differencing tool.  I've carried it with me ever
since, reimplementing it in at least a dozen languages along the way.  The
basic idea popped up when staring at some incomprehensible "diff" output, and
wondering "hmmmm -- what if diff restricted itself to *contiguous*
subsequences?  and then what if it didn't synch up on junk either?".  That's
about it.  You can find more discussion in the comments in module ndiff.py
(difflib's SequenceMatcher class started its life in ndiff.py years ago; it
simply got split out into its own module for 2.1).

When Eric Raymond mentioned the Ratcliff-Obershelp algorithm on Python-Dev,
it was news to me, and a Google search quickly turned up a C implementation
of that.  It was obviously the same basic algorithm, except without the "skip
junk" gimmick, and at least the implementations I found were much less
efficient than SequenceMatcher (this is one of those happy instances where my
Python code runs *faster* than previous C, Fortran and Pascal
implementations:  I had the *idea* of chaining together hash lists to speed
the search a long time ago, but never had the patience to *implement* that
before recoding in Python -- it's easy in Python).

> ...
> Are there any explanations available elsewhere?

You can find a little more by doing a google search.