Looking for library to estimate likeness of two strings

Robert Kern robert.kern at gmail.com
Thu Feb 7 00:32:53 CET 2008


Jeff Schwab wrote:
> Tim Chase wrote:
>>> Are there any Python libraries implementing measurement of similarity
>>> of two strings of Latin characters?
>> It sounds like you're interested in calculating the Levenshtein distance:
>>
>> http://en.wikipedia.org/wiki/Levenshtein_distance
>>
>> which gives you a measure of how different they are.  A measure of "0" 
>> is that the inputs are the same.  The more different the two strings 
>> are, the greater the resulting output of the function.
>>
>> Unfortunately, it's an O(MN) algorithm (where M=len(word1) and 
>> N=len(word2)) from my understanding of the code I've seen. However it 
>> really is the best approximation I've seen of a "how similar are these 
>> two strings" function.  Googling for
>>
>>   python levenshtein distance
>>
>> brings up oodles of hits.
> 
> If the strings happen to be the same length, the Levenshtein distance is 
> equivalent to the Hamming distance.  The Wikipedia article gives the 
> following Python implementation:
> 
> # http://en.wikipedia.org/wiki/Hamming_distance
> def hamdist(s1, s2):
>      assert len(s1) == len(s2)
>      return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:

   http://hetland.org/coding/python/levenshtein.py

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop.
:def levenshtein(a,b):
:    "Calculates the Levenshtein distance between a and b."
:    n, m = len(a), len(b)
:    if n > m:
:        # Make sure n <= m, to use O(min(n,m)) space
:        a,b = b,a
:        n,m = m,n
:
:    current = range(n+1)
:    for i in range(1,m+1):
:        previous, current = current, [i]+[0]*n
:        for j in range(1,n+1):
:            add, delete = previous[j]+1, current[j-1]+1
:            change = previous[j-1]
:            if a[j-1] != b[i-1]:
:                change = change + 1
:            current[j] = min(add, delete, change)
:
:    return current[n]
:--

In [2]:

In [3]: def hamdist(s1, s2):
    ...:      assert len(s1) == len(s2)
    ...:      return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
    ...:

In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6

In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2


-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco




More information about the Python-list mailing list