Fuzzy string matching
Magnus L. Hetland
mlh at idt.ntnu.no
Fri Aug 27 15:51:03 CEST 1999
Explanation of the distance algorithm...
(Got a request for this, and thought I might as well post it...)
The algorithm:
def distance(a,b):
c = {}
n = len(a); m = len(b)
for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j
for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]
It calculates the following: Given two strings, a and b, and three
operations, adding, subtracting and exchanging single characters, what
is the minimal number of steps needed to translate a into b?
The method is based on the following idea:
We want to find the distance between a[:x] and b[:y]. To do this, we
first calculate
1) the distance between a[:x-1] and b[:y], adding the cost of a
subtract-operation, used to get from a[:x] to a[:z-1];
2) the distance between a[:x] and b[:y-1], adding the cost of an
addition-operation, used to get from b[:y-1] to b[:y];
3) the distance between a[:x-1] and b[:y-1], adding the cost of a
*possible* exchange of the letter b[y] (with a[x]).
The cost of the subtraction and addition operations are 1, while the
exchange operation has a cost of 1 if a[x] and b[y] are different, and
0 otherwise.
After calculating these costs, we choose the least one of them (since
we want to use the best solution.)
Instead of doing this recursively (i.e. calculating ourselves "back"
from the final value), we build a cost-matrix c containing the optimal
costs, so we can reuse them when calculating the later values. The
costs c[i,0] (from string of length n to empty string) are all i, and
correspondingly all c[0,j] (from empty string to string of length j)
are j.
Finally, the cost of translating between the full strings a and b
(c[n,m]) is returned.
I guess that ought to cover it...
--
Magnus Making no sound / Yet smouldering with passion
Lie The firefly is still sadder / Than the moaning insect
Hetland : Minamoto Shigeyuki
More information about the Python-list
mailing list