[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Eli Bendersky eliben at gmail.com
Wed Jul 7 09:45:40 CEST 2010


[snip]
> Yes, that was the intent.  I was corresponding with a user at the time
> who had odd notions (well, by my standards) of how to format C code,
> which left him with many hundreds of lines containing only an open
> brace, or a close brace, or just a semicolon (etc).  difflib spun its
> wheels frantically trying to sort this out, and the heuristic in
> question cut processing time from hours (in the worst cases) to
> seconds.
>
> Since that (text file comparison) was always the primary case for this
> class, it was worth doing something about.  But it should not have
> gone in the way it did (undocumented & unfinished, as you correctly
> note).
>
>> which are selected from an
>> unbounded set of possibilities. As explained below, this heuristic partly to
>> completely disables SequenceMatcher for realistic-length sequences from a
>> small finite alphabet.
>
> Which wasn't an anticipated use case, so should not be favored.
> Slowing down difflib for what it was intended for is not a good idea -
> practicality beats purity.
>
> Ya, ya, I understand someone playing around with DNA sequences might
> find difflib tempting at first, but fix this and they're still going
> to be unhappy.  There are much better (faster, "more standard")
> algorithms for comparing sequences drawn from tiny alphabets, and
> especially so for DNA matching.

Tim, thanks for your insights. In response to the description above,
however, I would like to explain my use case, which originally got me
interested in this issue with SequenceMatcher.

I was not comparing DNAs, but using SequenceMatcher in my automatic
testbench checker that verified the output of some logic design. I
didn't want exact comparisons, so I was very happy to see
difflib.SequenceMatcher in stdlib, with its useful ratio/quick_ratio
functions. I was comparing the output sequence to an expected sequence
with a 0.995 ratio threshold and was very happy. Until my sequence got
longer than 200 elements...

So this isn't DNA, and the alphabet wasn't too tiny, but on the other
hand there was nothing in the module to suggest that it should be only
used to comparing lines in files. On the contrary, its
general-sounding name - SequenceMatcher, lulled me into the (false?)
belief that I can just use it for my sequence comparison without
worrying about finding better algorithms or implementing stuff like
edit distance myself. Judging by the comments in other related issues,
I'm far from being the only one.

Therefore, I think that you should just admit that your excellent
module became useful for more purposes than you originally intended it
for :-) !! I completely respect your desire to keep the "intended
purposes" as fast as possible, but there are solutions (some of which
were presented by Terry) that can make it more useful without any harm
to the performance of the intended purpose.

As Terry noted, I will be very happy to submit a patch with tests for
whatever decision that will be reached by pydev on this matter.

Eli


More information about the Python-Dev mailing list