A bug in difflib module? (find_longest_match)

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu May 1 12:52:02 CEST 2008


En Thu, 01 May 2008 06:21:22 -0300, n00m <n00m at narod.ru> escribió:

>> > import difflib
>> > s = difflib.SequenceMatcher(None, s1, s2)
>> > print s.find_longest_match(0, len(s1), 0, len(s2))
>> > (0, 0, 0)
>> >
>> > I think it's line #314 in difflib "who's to blame" --
>>
>> Me too. Could you think of some alternative? Simply disabling that
>> "popularity check" would slow down the algorithm, according to the
>> comments.
>
> No idea :)

The "ignore popular elements" is only an optmization, and it should not be  
applied in your case because it forces the algorithm to yield an invalid  
result.
I can think of two alternatives:
- tune up the conditions when the optimization is used, or
- make it user configurable

SequenceMatcher is a public class, and it is also internally used by  
Differ and others to compare both sequences of lines *and* pairs of  
similar lines (considered as sequences of characters). In this last usage  
the "ignore popular elements" has no much sense, as shown in your example  
feeding directly two dissimilar strings.
In principle one should disable the "populardict" stuff when dealing with  
strings. Below is a simple attempt to detect that case:

(around line 311 in difflib.py)

         b_is_string = isinstance(b, basestring) # add this line
         for i, elt in enumerate(b):
             if elt in b2j:
                 indices = b2j[elt]
                 if not b_is_string and n >= 200 and len(indices) * 100 >  
n: # change this line
                     populardict[elt] = 1
                     del indices[:]
                 else:
                     indices.append(i)
             else:
                 b2j[elt] = [i]


-- 
Gabriel Genellina



More information about the Python-list mailing list