Looking for library to estimate likeness of two strings
robert.kern at gmail.com
Thu Feb 7 00:32:53 CET 2008
Jeff Schwab wrote:
> Tim Chase wrote:
>>> Are there any Python libraries implementing measurement of similarity
>>> of two strings of Latin characters?
>> It sounds like you're interested in calculating the Levenshtein distance:
>> which gives you a measure of how different they are. A measure of "0"
>> is that the inputs are the same. The more different the two strings
>> are, the greater the resulting output of the function.
>> Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
>> N=len(word2)) from my understanding of the code I've seen. However it
>> really is the best approximation I've seen of a "how similar are these
>> two strings" function. Googling for
>> python levenshtein distance
>> brings up oodles of hits.
> If the strings happen to be the same length, the Levenshtein distance is
> equivalent to the Hamming distance. The Wikipedia article gives the
> following Python implementation:
> # http://en.wikipedia.org/wiki/Hamming_distance
> def hamdist(s1, s2):
> assert len(s1) == len(s2)
> return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
In : %cpaste
Pasting code; enter '--' alone on the line to stop.
: "Calculates the Levenshtein distance between a and b."
: n, m = len(a), len(b)
: if n > m:
: # Make sure n <= m, to use O(min(n,m)) space
: a,b = b,a
: n,m = m,n
: current = range(n+1)
: for i in range(1,m+1):
: previous, current = current, [i]+*n
: for j in range(1,n+1):
: add, delete = previous[j]+1, current[j-1]+1
: change = previous[j-1]
: if a[j-1] != b[i-1]:
: change = change + 1
: current[j] = min(add, delete, change)
: return current[n]
In : def hamdist(s1, s2):
...: assert len(s1) == len(s2)
...: return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
In : hamdist('abcdef', 'fabcde')
In : levenshtein('abcdef', 'fabcde')
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list