Looking for library to estimate likeness of two strings
Robert Kern
robert.kern at gmail.com
Thu Feb 7 00:32:53 CET 2008
Jeff Schwab wrote:
> Tim Chase wrote:
>>> Are there any Python libraries implementing measurement of similarity
>>> of two strings of Latin characters?
>> It sounds like you're interested in calculating the Levenshtein distance:
>>
>> http://en.wikipedia.org/wiki/Levenshtein_distance
>>
>> which gives you a measure of how different they are. A measure of "0"
>> is that the inputs are the same. The more different the two strings
>> are, the greater the resulting output of the function.
>>
>> Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
>> N=len(word2)) from my understanding of the code I've seen. However it
>> really is the best approximation I've seen of a "how similar are these
>> two strings" function. Googling for
>>
>> python levenshtein distance
>>
>> brings up oodles of hits.
>
> If the strings happen to be the same length, the Levenshtein distance is
> equivalent to the Hamming distance. The Wikipedia article gives the
> following Python implementation:
>
> # http://en.wikipedia.org/wiki/Hamming_distance
> def hamdist(s1, s2):
> assert len(s1) == len(s2)
> return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
http://hetland.org/coding/python/levenshtein.py
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop.
:def levenshtein(a,b):
: "Calculates the Levenshtein distance between a and b."
: n, m = len(a), len(b)
: if n > m:
: # Make sure n <= m, to use O(min(n,m)) space
: a,b = b,a
: n,m = m,n
:
: current = range(n+1)
: for i in range(1,m+1):
: previous, current = current, [i]+[0]*n
: for j in range(1,n+1):
: add, delete = previous[j]+1, current[j-1]+1
: change = previous[j-1]
: if a[j-1] != b[i-1]:
: change = change + 1
: current[j] = min(add, delete, change)
:
: return current[n]
:--
In [2]:
In [3]: def hamdist(s1, s2):
...: assert len(s1) == len(s2)
...: return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
...:
In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6
In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list
mailing list