
[Nick Coghlan]
... Hmm, I've been using difflib.SequenceMatcher for years in a serial bit error rate tester (with typical message sizes ranging from tens of bytes to tens of thousands of bytes) that occasionally gives unexpected results. I'd been blaming hardware glitches (and, to be fair, all of the odd results I can recall off the top of my head were definitively traced to problems in the hardware under test), but I should probably check I'm not running afoul of this bug.
That would be prudent ;-)
And Tim, the algorithm may not be optimal as a general purpose binary diff algorithm, but it's still a hell of a lot faster than the hardware I use it to test. Compared to the equipment configuration times, the data comparison time is trivial.
I'm all in favor of people using the class for any purpose they find useful. Just saying the overwhelmingly most common use is for comparing text files, and that's important - most important, since most widely used.
There's another possibility here - perhaps the heuristic should be off by default in SequenceMatcher, with a TextMatcher subclass that enables it (and Differ and HtmlDiff then inheriting from the latter)?
Or, to make life easiest for the most common uses, create a subclass that _didn't_ have any notion of "junk" whatsoever. Or a new flag to turn auto-junk heuristics off. Unfortunately, no true solution exists without changing _something_ in the API, and since the behavior changed 8 years ago there's just no guessing how many uses rely on the by-now-long-current behavior.
There's currently barely anything in the SequenceMatcher documentation to indicate that it is designed primarily for comparing text rather than arbitrary sequences (the closest it gets is the reference to Ratcliff/Obserhelp gestalt pattern matching and then the link to the Ratcliff/Metzener Dr Dobb's article - and until this thread, I'd never followed the article link).
It's designed to compare sequences of hashable elements. The use cases that _drove_ the implementation were (1) viewing a file as a sequence of lines; and (2), viewing a line as a sequence of characters. Use cases always drive implementation (although rarely mentioned the docs), but striving for natural generalization sometimes pays off. I expected it would in this case, and that others have found unanticipated uses confirms that it did. Unfortunately, I screwed up by not finishing what I started 8 years ago (adding an auto-junk heuristic that was very valuable in the primary use case, but turned out to have very bad effects in some other cases).
Rather than reverting to Tim's undocumented vision, perhaps we should better articulate it by separating the general purpose matcher from an optimised text matcher.
Having a notion of junk can improve the quality of results (in the sense of making them more intuitive to humans, which was an explicit goal of the module), and can yield enormous speedups. Is this restricted solely to comparing text files? I don't know that to be the case, but offhand doubt it's true. As always, if they exist, we won't hear from people with other use cases who never noticed the change (except perhaps to see improved speed and better results) - not until we "turn off" the heuristic on them, and then they'll view _that_ as "a bug".