[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken
Nick Coghlan
ncoghlan at gmail.com
Wed Jul 7 15:28:36 CEST 2010
On Wed, Jul 7, 2010 at 9:18 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> In the commit message for revision 26661, which added the heuristic, Tim
> Peters wrote "While I like what I've seen of the effects so far, I still
> consider this experimental. Please give it a try!" Several people who have
> tried it discovered the problem with small alphabets and posted to the
> tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed
> duplicates of #2986. The heuristic needs revision.
Python 2.3 you say...
Hmm, I've been using difflib.SequenceMatcher for years in a serial bit
error rate tester (with typical message sizes ranging from tens of
bytes to tens of thousands of bytes) that occasionally gives
unexpected results. I'd been blaming hardware glitches (and, to be
fair, all of the odd results I can recall off the top of my head were
definitively traced to problems in the hardware under test), but I
should probably check I'm not running afoul of this bug.
And Tim, the algorithm may not be optimal as a general purpose binary
diff algorithm, but it's still a hell of a lot faster than the
hardware I use it to test. Compared to the equipment configuration
times, the data comparison time is trivial.
There's another possibility here - perhaps the heuristic should be off
by default in SequenceMatcher, with a TextMatcher subclass that
enables it (and Differ and HtmlDiff then inheriting from the latter)?
There's currently barely anything in the SequenceMatcher documentation
to indicate that it is designed primarily for comparing text rather
than arbitrary sequences (the closest it gets is the reference to
Ratcliff/Obserhelp gestalt pattern matching and then the link to the
Ratcliff/Metzener Dr Dobb's article - and until this thread, I'd never
followed the article link). Rather than reverting to Tim's
undocumented vision, perhaps we should better articulate it by
separating the general purpose matcher from an optimised text matcher.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
More information about the Python-Dev
mailing list