# Looking for library to estimate likeness of two strings

Robert Kern robert.kern at gmail.com
Thu Feb 7 00:32:53 CET 2008

```Jeff Schwab wrote:
> Tim Chase wrote:
>>> Are there any Python libraries implementing measurement of similarity
>>> of two strings of Latin characters?
>> It sounds like you're interested in calculating the Levenshtein distance:
>>
>> http://en.wikipedia.org/wiki/Levenshtein_distance
>>
>> which gives you a measure of how different they are.  A measure of "0"
>> is that the inputs are the same.  The more different the two strings
>> are, the greater the resulting output of the function.
>>
>> Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
>> N=len(word2)) from my understanding of the code I've seen. However it
>> really is the best approximation I've seen of a "how similar are these
>> two strings" function.  Googling for
>>
>>   python levenshtein distance
>>
>> brings up oodles of hits.
>
> If the strings happen to be the same length, the Levenshtein distance is
> equivalent to the Hamming distance.  The Wikipedia article gives the
> following Python implementation:
>
> # http://en.wikipedia.org/wiki/Hamming_distance
> def hamdist(s1, s2):
>      assert len(s1) == len(s2)
>      return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:

http://hetland.org/coding/python/levenshtein.py

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop.
:def levenshtein(a,b):
:    "Calculates the Levenshtein distance between a and b."
:    n, m = len(a), len(b)
:    if n > m:
:        # Make sure n <= m, to use O(min(n,m)) space
:        a,b = b,a
:        n,m = m,n
:
:    current = range(n+1)
:    for i in range(1,m+1):
:        previous, current = current, [i]+[0]*n
:        for j in range(1,n+1):
:            add, delete = previous[j]+1, current[j-1]+1
:            change = previous[j-1]
:            if a[j-1] != b[i-1]:
:                change = change + 1
:            current[j] = min(add, delete, change)
:
:    return current[n]
:--

In [2]:

In [3]: def hamdist(s1, s2):
...:      assert len(s1) == len(s2)
...:      return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
...:

In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6

In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma