Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

July 6, 2010

      [Terry Reedy]
...
[Also posted to http://bugs.python.org/issue2986
Developed with input from Eli Bendersky, who will write patchfile(s) for
whichever change option is chosen.]
Thanks for paying attention to this, Terry (and Ed)!  I somehow
managed to miss the whole discussion over the intervening years :-(
...
Summary: difflib.SeqeunceMatcher was developed, documented, and originally
operated as "a flexible class for comparing pairs of sequences of any
[hashable] type".
Although it can be used for that, its true intent was to produce
intuitive diffs for revisions of text files (including code files)
edited by humans.  Where "intuitive" means approximately "less jarring
than the often-odd diffs produced by algorithms working on some rigid
mathematical notion of 'minimal edit distance'".

Whether it's useful for more than that I can't say, because that's all
I ever developed (or used) the algorithm for.
...
An "experimental" heuristic was added in 2.3a1
Big bad on me for that!  At the time I fully intended to document
that, and at least make it tunable, but life intervened and it got
dropped on the floor.
...
to speed up its application to sequences of code lines,
Yes, that was the intent.  I was corresponding with a user at the time
who had odd notions (well, by my standards) of how to format C code,
which left him with many hundreds of lines containing only an open
brace, or a close brace, or just a semicolon (etc).  difflib spun its
wheels frantically trying to sort this out, and the heuristic in
question cut processing time from hours (in the worst cases) to
seconds.

Since that (text file comparison) was always the primary case for this
class, it was worth doing something about.  But it should not have
gone in the way it did (undocumented & unfinished, as you correctly
note).
...
which are selected from an
unbounded set of possibilities. As explained below, this heuristic partly to
completely disables SequenceMatcher for realistic-length sequences from a
small finite alphabet.
Which wasn't an anticipated use case, so should not be favored.
Slowing down difflib for what it was intended for is not a good idea -
practicality beats purity.

Ya, ya, I understand someone playing around with DNA sequences might
find difflib tempting at first, but fix this and they're still going
to be unhappy.  There are much better (faster, "more standard")
algorithms for comparing sequences drawn from tiny alphabets, and
especially so for DNA matching.
...
The regression is easy to fix. The docs were never
changed to reflect the effect of the heuristic, but should be, with whatever
additional change is made.
True - and always was.
...
In the commit message for revision 26661, which added the heuristic, Tim
Peters wrote "While I like what I've seen of the effects so far, I still
consider this experimental.  Please give it a try!" Several people who have
tried it discovered the problem with small alphabets and posted to the
tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed
duplicates of #2986. The heuristic needs revision.
Open questions (discussed after the examples): what exactly to do, which
versions to do it too, and who will do it.
---
Some minimal difference examples:
from difflib import SequenceMatcher as SM
# base example
print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.9975 (rounded)
# make 'y' junk
print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.0
# Increment b by 1 char
print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
# should be .995, but now is 0.0 because y is treated as junk
# Reverse a and b, which increments b
print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
# should be .9975, as before, but now is 0.0 because y is junked
The reason for the bug is the heuristic: if the second sequence is at least
200 items long then any item occurring more than one percent of the time in
the second sequence is treated as junk. This was aimed at recurring code
lines like 'else:' and 'return', but can be fatal for small alphabets where
common items are necessary content.
Indeed, it makes no sense at all for tiny alphabets.  OTOH, as above,
it gave factor-of-thousands speedups for intended use cases, and
that's more important to me.  There should certainly be a way to turn
off the "auto junk" heuristic, and to tune it, but - sorry for being
pragmatic ;-) - it was a valuable speed improvement for what I expect
still remain difflib's overwhelmingly most common use cases.
...
A more realistic example than the above is comparing DNA gene sequences.
Comparing DNA sequences is realistic, but using SequenceMatcher to do
so is unrealistic except for a beginner just playing with the ideas.
There should be a way to disable the heuristic so the beginner can
have their fun, but any serious work in this area will need to use
different algorithms.
...
Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate
sequence of matches and edits and .ratio works as documented and expected.
 For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are
<1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic
and practical use.
Did I just call you a beginner just playing with the ideas?  Could be
;-)  People doing serious work in this area employ a great many other
approaches specifically designed for DNA matching:

    http://en.wikipedia.org/wiki/Sequence_alignment

and have developed dozens of software packages to implement them:

    http://en.wikipedia.org/wiki/Sequence_alignment_software
...
With the heuristic, everything is junk and there is only one match, ''==''
augmented by the initial prefix of matching bases. This is followed by one
edit: replace the rest of the first sequence with the rest of the second
sequence. A much faster way to find the first mismatch would be
  i = 0
  while first[i] == second[i]:
     i+=1
The match ratio, based on the initial matching prefix only, is spuriously
low.
The details don't really matter to the conclusion:  by definition,
"junk" is ignored for purposes of finding a match.  Therefore any
method of automatically calling something "junk" is going to screw up
_some_ use case.  I agree that's unacceptable - there must be a way to
turn that off.  But, again, it proved valuable for SequenceMatcher's
_primary_ use case, so keeping it enabled by default is the most
sensible thing to do.
...
---
Questions:
1: what change should be make.
Proposed fix: Disentangle the heuristic from the calculation of the internal
b2j dict that maps items to indexes in the second sequence b. Only apply the
heuristic (or not) afterward.
Version A: Modify the heuristic to only eliminate common items when there
are more than, say, 100 items (when len(b2j)> 100 where b2j is first
calculated without popularity deletions).
The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone.
On the other hand, realistic sequences of more than 200 code lines should
have at least 100 different lines, and so the heuristic should continue to
be applied when it (mostly?) 'should' be. This change leaves the API
unchanged and does not require a user decision.
It's a better heuristic, in the sense that it sidesteps a fatal
problem in the "tiny alphabet" use case.  I'm sure it will leave fatal
problems for other (unanticipated) use cases, though.  So, an
improvement, but not "a solution".
...
Version B: add a parameter to .__init__ to make the heuristic optional. If
the default were True ('use it'), then the code would run the same as now
(even when bad). With the heuristic turned off, users would be able to get
the .ratio they may expect and need. On the other hand, users would have to
understand the heuristic to know when and when not to use it.
Version C: A more radical alternative would be to make one or more of the
tuning parameters user settable, with one setting turning it off.
I don't see that they're mutually exclusive.  I expect that the best
design would have both a better default heuristic _and_ a way to tune
or disable it entirely.  A & C above.
...
2. What type of issue is this, and what version get changed.
I see the proposal as partial reversion of a change that sometimes causes a
regression, in order to fix the regression. Such would usually be called a
bugfix. Other tracker reviewers claim this issue is a feature request, not a
bugfix. Either way, 3.2 gets the fix.
It's definitely a bug in the sense that it broke some previously
working code.  But pull the patch, and the user I corresponded with at
the start will see code comparison times leap back from seconds to
hours, which effectively makes difflib useless for him, and so breaks
_his_ real application (he used difflib in a commercial text editor,
BTW).  Both are unacceptable - but that's how things are sometimes.  I
favor the primary use case (text file comparison).
...
The practical issue is whether at least 2.7(.1) should get the fix, or whether
the bug should forever continue in 2.x.
I'm afraid there's no true fix here without an API change (any
heuristic for auto-identifying junk will break _some_ use case).  Your
#A sounds good to me for 2.7, and a combination of #A and #C sounds
best for 3.x.
...
3. Who will make the change.
Eli will write a patch and I will check it. However, Georg Brandel assigned
the issue to Tim Peters, with a request for comment, but Tim never
responded.
I see he asked In March of 2009 - he should have picked a month I was
paying attention - or at least a year ;-)
...
Is there an active committer who will grab the issue and do a
commit review when a patch is ready?
If Guido will lend me his time machine, I'll go back to 2002 and apply
the patch then, when it should have been done :-(