[snip]
Yes, that was the intent. I was corresponding with a user at the time who had odd notions (well, by my standards) of how to format C code, which left him with many hundreds of lines containing only an open brace, or a close brace, or just a semicolon (etc). difflib spun its wheels frantically trying to sort this out, and the heuristic in question cut processing time from hours (in the worst cases) to seconds.
Since that (text file comparison) was always the primary case for this class, it was worth doing something about. But it should not have gone in the way it did (undocumented & unfinished, as you correctly note).
which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet.
Which wasn't an anticipated use case, so should not be favored. Slowing down difflib for what it was intended for is not a good idea - practicality beats purity.
Ya, ya, I understand someone playing around with DNA sequences might find difflib tempting at first, but fix this and they're still going to be unhappy. There are much better (faster, "more standard") algorithms for comparing sequences drawn from tiny alphabets, and especially so for DNA matching.
Tim, thanks for your insights. In response to the description above, however, I would like to explain my use case, which originally got me interested in this issue with SequenceMatcher. I was not comparing DNAs, but using SequenceMatcher in my automatic testbench checker that verified the output of some logic design. I didn't want exact comparisons, so I was very happy to see difflib.SequenceMatcher in stdlib, with its useful ratio/quick_ratio functions. I was comparing the output sequence to an expected sequence with a 0.995 ratio threshold and was very happy. Until my sequence got longer than 200 elements... So this isn't DNA, and the alphabet wasn't too tiny, but on the other hand there was nothing in the module to suggest that it should be only used to comparing lines in files. On the contrary, its general-sounding name - SequenceMatcher, lulled me into the (false?) belief that I can just use it for my sequence comparison without worrying about finding better algorithms or implementing stuff like edit distance myself. Judging by the comments in other related issues, I'm far from being the only one. Therefore, I think that you should just admit that your excellent module became useful for more purposes than you originally intended it for :-) !! I completely respect your desire to keep the "intended purposes" as fast as possible, but there are solutions (some of which were presented by Terry) that can make it more useful without any harm to the performance of the intended purpose. As Terry noted, I will be very happy to submit a patch with tests for whatever decision that will be reached by pydev on this matter. Eli