[Terry Reedy]
[Also posted to http://bugs.python.org/issue2986 Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.]
Thanks for paying attention to this, Terry (and Ed)! I somehow managed to miss the whole discussion over the intervening years :-(
Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type".
Although it can be used for that, its true intent was to produce intuitive diffs for revisions of text files (including code files) edited by humans. Where "intuitive" means approximately "less jarring than the often-odd diffs produced by algorithms working on some rigid mathematical notion of 'minimal edit distance'". Whether it's useful for more than that I can't say, because that's all I ever developed (or used) the algorithm for.
An "experimental" heuristic was added in 2.3a1
Big bad on me for that! At the time I fully intended to document that, and at least make it tunable, but life intervened and it got dropped on the floor.
to speed up its application to sequences of code lines,
Yes, that was the intent. I was corresponding with a user at the time who had odd notions (well, by my standards) of how to format C code, which left him with many hundreds of lines containing only an open brace, or a close brace, or just a semicolon (etc). difflib spun its wheels frantically trying to sort this out, and the heuristic in question cut processing time from hours (in the worst cases) to seconds. Since that (text file comparison) was always the primary case for this class, it was worth doing something about. But it should not have gone in the way it did (undocumented & unfinished, as you correctly note).
which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet.
Which wasn't an anticipated use case, so should not be favored. Slowing down difflib for what it was intended for is not a good idea - practicality beats purity. Ya, ya, I understand someone playing around with DNA sequences might find difflib tempting at first, but fix this and they're still going to be unhappy. There are much better (faster, "more standard") algorithms for comparing sequences drawn from tiny alphabets, and especially so for DNA matching.
The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made.
True - and always was.
In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed duplicates of #2986. The heuristic needs revision.
Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it.
--- Some minimal difference examples:
from difflib import SequenceMatcher as SM
# base example print(SM(None, 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.9975 (rounded)
# make 'y' junk print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.0
# Increment b by 1 char print(SM(None, 'x' + 'y'*199, 'y'*200).ratio()) # should be .995, but now is 0.0 because y is treated as junk
# Reverse a and b, which increments b print(SM(None, 'y'*199, 'x' + 'y'*199).ratio()) # should be .9975, as before, but now is 0.0 because y is junked
The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.
Indeed, it makes no sense at all for tiny alphabets. OTOH, as above, it gave factor-of-thousands speedups for intended use cases, and that's more important to me. There should certainly be a way to turn off the "auto junk" heuristic, and to tune it, but - sorry for being pragmatic ;-) - it was a valuable speed improvement for what I expect still remain difflib's overwhelmingly most common use cases.
A more realistic example than the above is comparing DNA gene sequences.
Comparing DNA sequences is realistic, but using SequenceMatcher to do so is unrealistic except for a beginner just playing with the ideas. There should be a way to disable the heuristic so the beginner can have their fun, but any serious work in this area will need to use different algorithms.
Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected. For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use.
Did I just call you a beginner just playing with the ideas? Could be ;-) People doing serious work in this area employ a great many other approaches specifically designed for DNA matching: http://en.wikipedia.org/wiki/Sequence_alignment and have developed dozens of software packages to implement them: http://en.wikipedia.org/wiki/Sequence_alignment_software
With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be i = 0 while first[i] == second[i]: i+=1 The match ratio, based on the initial matching prefix only, is spuriously low.
The details don't really matter to the conclusion: by definition, "junk" is ignored for purposes of finding a match. Therefore any method of automatically calling something "junk" is going to screw up _some_ use case. I agree that's unacceptable - there must be a way to turn that off. But, again, it proved valuable for SequenceMatcher's _primary_ use case, so keeping it enabled by default is the most sensible thing to do.
--- Questions:
1: what change should be make.
Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward.
Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions).
The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision.
It's a better heuristic, in the sense that it sidesteps a fatal problem in the "tiny alphabet" use case. I'm sure it will leave fatal problems for other (unanticipated) use cases, though. So, an improvement, but not "a solution".
Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it.
Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.
I don't see that they're mutually exclusive. I expect that the best design would have both a better default heuristic _and_ a way to tune or disable it entirely. A & C above.
2. What type of issue is this, and what version get changed.
I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix.
It's definitely a bug in the sense that it broke some previously working code. But pull the patch, and the user I corresponded with at the start will see code comparison times leap back from seconds to hours, which effectively makes difflib useless for him, and so breaks _his_ real application (he used difflib in a commercial text editor, BTW). Both are unacceptable - but that's how things are sometimes. I favor the primary use case (text file comparison).
The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x.
I'm afraid there's no true fix here without an API change (any heuristic for auto-identifying junk will break _some_ use case). Your #A sounds good to me for 2.7, and a combination of #A and #C sounds best for 3.x.
3. Who will make the change.
Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded.
I see he asked In March of 2009 - he should have picked a month I was paying attention - or at least a year ;-)
Is there an active committer who will grab the issue and do a commit review when a patch is ready?
If Guido will lend me his time machine, I'll go back to 2002 and apply the patch then, when it should have been done :-(