Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

July 7, 2010


      [Nick Coghlan]
...
...
Hmm, I've been using difflib.SequenceMatcher for years in a serial bit
error rate tester (with typical message sizes ranging from tens of
bytes to tens of thousands of bytes) that occasionally gives
unexpected results. I'd been blaming hardware glitches (and, to be
fair, all of the odd results I can recall off the top of my head were
definitively traced to problems in the hardware under test), but I
should probably check I'm not running afoul of this bug.
That would be prudent ;-)
...
And Tim, the algorithm may not be optimal as a general purpose binary
diff algorithm, but it's still a hell of a lot faster than the
hardware I use it to test. Compared to the equipment configuration
times, the data comparison time is trivial.
I'm all in favor of people using the class for any purpose they find
useful.  Just saying the overwhelmingly most common use is for
comparing text files, and that's important - most important, since
most widely used.
...
There's another possibility here - perhaps the heuristic should be off
by default in SequenceMatcher, with a TextMatcher subclass that
enables it (and Differ and HtmlDiff then inheriting from the latter)?
Or, to make life easiest for the most common uses, create a subclass
that _didn't_ have any notion of "junk" whatsoever.  Or a new flag to
turn auto-junk heuristics off.  Unfortunately, no true solution exists
without changing _something_ in the API, and since the behavior
changed 8 years ago there's just no guessing how many uses rely on the
by-now-long-current behavior.
...
There's currently barely anything in the SequenceMatcher documentation
to indicate that it is designed primarily for comparing text rather
than arbitrary sequences (the closest it gets is the reference to
Ratcliff/Obserhelp gestalt pattern matching and then the link to the
Ratcliff/Metzener Dr Dobb's article - and until this thread, I'd never
followed the article link).
It's designed to compare sequences of hashable elements.  The use
cases that _drove_ the implementation were (1) viewing a file as a
sequence of lines; and (2), viewing a line as a sequence of
characters.  Use cases always drive implementation (although rarely
mentioned the docs), but striving for natural generalization sometimes
pays off.  I expected it would in this case, and that others have
found unanticipated uses confirms that it did.  Unfortunately, I
screwed up by not finishing what I started 8 years ago (adding an
auto-junk heuristic that was very valuable in the primary use case,
but turned out to have very bad effects in some other cases).
...
Rather than reverting to Tim's undocumented vision, perhaps we should
better articulate it by separating the general purpose matcher from
an optimised text matcher.
Having a notion of junk can improve the quality of results (in the
sense of making them more intuitive to humans, which was an explicit
goal of the module), and can yield enormous speedups.  Is this
restricted solely to comparing text files?  I don't know that to be
the case, but offhand doubt it's true.  As always, if they exist, we
won't hear from people with other use cases who never noticed the
change (except perhaps to see improved speed and better results) - not
until we "turn off" the heuristic on them, and then they'll view
_that_ as "a bug".

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Tim Peters