[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Wed Jul 7 01:18:09 CEST 2010

[Also posted to http://bugs.python.org/issue2986
Developed with input from Eli Bendersky, who will write patchfile(s) for 
whichever change option is chosen.]

Summary: difflib.SeqeunceMatcher was developed, documented, and 
originally operated as "a flexible class for comparing pairs of 
sequences of any [hashable] type". An "experimental" heuristic was added 
in 2.3a1 to speed up its application to sequences of code lines, which 
are selected from an unbounded set of possibilities. As explained below, 
this heuristic partly to completely disables SequenceMatcher for 
realistic-length sequences from a small finite alphabet. The regression 
is easy to fix. The docs were never changed to reflect the effect of the 
heuristic, but should be, with whatever additional change is made.

In the commit message for revision 26661, which added the heuristic, Tim 
Peters wrote "While I like what I've seen of the effects so far, I still 
consider this experimental.  Please give it a try!" Several people who 
have tried it discovered the problem with small alphabets and posted to 
the tracker. Issues #1528074, #1678339. #1678345, and #4622 are 
now-closed duplicates of #2986. The heuristic needs revision.

Open questions (discussed after the examples): what exactly to do, which 
versions to do it too, and who will do it.

---
Some minimal difference examples:

from difflib import SequenceMatcher as SM

# base example
print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.9975 (rounded)

# make 'y' junk
print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.0

# Increment b by 1 char
print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
# should be .995, but now is 0.0 because y is treated as junk

# Reverse a and b, which increments b
print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
# should be .9975, as before, but now is 0.0 because y is junked

The reason for the bug is the heuristic: if the second sequence is at 
least 200 items long then any item occurring more than one percent of 
the time in the second sequence is treated as junk. This was aimed at 
recurring code lines like 'else:' and 'return', but can be fatal for 
small alphabets where common items are necessary content.

A more realistic example than the above is comparing DNA gene sequences. 
Without the heuristic SequenceMatcher.get_opcodes() reports an 
appropriate sequence of matches and edits and .ratio works as documented 
and expected.  For 1000/2000/6000 bases, the times on a old Athlon 2800 
machine are <1/2/12 seconds. Since 6000 is longer than most genes, this 
is a realistic and practical use.

With the heuristic, everything is junk and there is only one match, 
''=='' augmented by the initial prefix of matching bases. This is 
followed by one edit: replace the rest of the first sequence with the 
rest of the second sequence. A much faster way to find the first 
mismatch would be
    i = 0
    while first[i] == second[i]:
       i+=1
The match ratio, based on the initial matching prefix only, is 
spuriously low.

---
Questions:

1: what change should be make.

Proposed fix: Disentangle the heuristic from the calculation of the 
internal b2j dict that maps items to indexes in the second sequence b. 
Only apply the heuristic (or not) afterward.

Version A: Modify the heuristic to only eliminate common items when 
there are more than, say, 100 items (when len(b2j)> 100 where b2j is 
first calculated without popularity deletions).

The would leave DNA, protein, and printable ascii+[\n\r\t] sequences 
alone. On the other hand, realistic sequences of more than 200 code 
lines should have at least 100 different lines, and so the heuristic 
should continue to be applied when it (mostly?) 'should' be. This change 
leaves the API unchanged and does not require a user decision.

Version B: add a parameter to .__init__ to make the heuristic optional. 
If the default were True ('use it'), then the code would run the same as 
now (even when bad). With the heuristic turned off, users would be able 
to get the .ratio they may expect and need. On the other hand, users 
would have to understand the heuristic to know when and when not to use it.

Version C: A more radical alternative would be to make one or more of 
the tuning parameters user settable, with one setting turning it off.

2. What type of issue is this, and what version get changed.

I see the proposal as partial reversion of a change that sometimes 
causes a regression, in order to fix the regression. Such would usually 
be called a bugfix. Other tracker reviewers claim this issue is a 
feature request, not a bugfix. Either way, 3.2 gets the fix. The 
practical issue is whether at least 2.7(.1) should get the fix, or 
whether the bug should forever continue in 2.x.

3. Who will make the change.

Eli will write a patch and I will check it. However, Georg Brandel 
assigned the issue to Tim Peters, with a request for comment, but Tim 
never responded. Is there an active committer who will grab the issue 
and do a commit review when a patch is ready?

-- 
Terry Jan Reedy