diff lists
Alex Martelli
aleaxit at yahoo.com
Fri Mar 30 14:30:05 CEST 2001
"Steven D. Majewski" <sdm7g at Virginia.EDU> wrote in message
news:mailman.985914639.5108.python-list at python.org...
[snip]
> On 29 Mar 2001, Michael Hudson wrote:
> > Tangentially, does anyone know of any good algorithms for "edit
> > distance" between two sequences? E.g. if I have
> > "abcdef"
> > and want to get to
> > "abQUACKcde"
> > I want to get the answer back "insert 'QUACK' at position 3 and delete
> > a character at position 11".
> > "Good" means "pretty quick", here.
[snip]
> http://www.ling.ohio-state.edu/~cbrew/795M/string-distance.html
>
> However, the algorithm for the distance does not give location, but
> maybe this info can be extracted from the intermediate matrix.
Thanks Steve for the good pointers -- the latter in particular
(by Chris Brew) is expressed in pretty-clear Pythonish pseudocode
(although incomplete and suffering a typo or two) so it was not
hard to transliterate -- here are both versions, distance only
and full-path. I've CC'd Chris (as credit for using his page:-)
[Chris, the incompleteness is on the second half of what I
called edit_path, and the typo is in your line which goes
d(i+1,0) = (sofar + delCost,DEL,d(0,j)) (**)
where it should be
d(i+1,0) = (sofar + delCost,DEL,d(i,o)) (**)
but, mainly, thanks for that page!-)].
def atomDistance(char1, char2, substCost=1.0):
if char1==char2: return 0
return substCost
def edit_dist(string1, string2, delCost=1.0, insCost=1.0,
atomDistance=atomDistance):
m = len(string1)
n = len(string2)
d = [ [0.0]*(n+1) for i in range(m+1) ]
for i in range(m):
d[i+1][0] = d[i][0] + delCost
for j in range(n):
d[0][j+1] = d[0][j] + insCost
for i in range(m):
for j in range(n):
subst = atomDistance(string1[i], string2[j])
d[i+1][j+1] = min(d[i][j]+subst,
d[i+1][j]+insCost,
d[i][j+1]+delCost)
return d[m][n]
def edit_path(string1, string2, delCost=1.0, insCost=1.0,
atomDistance=atomDistance):
m = len(string1)
n = len(string2)
d = [ [(0.0,'Start',None)]*(n+1) for i in range(m+1) ]
for i in range(m):
cost_so_far, move, tb = d[i][0]
d[i+1][0] = cost_so_far+delCost, 'Del(%s)'%string1[i], d[i][0]
for j in range(n):
cost_so_far, move, tb = d[0][j]
d[0][j+1] = cost_so_far+insCost, 'Ins(%s)'%string2[j], d[0][j]
for i in range(m):
for j in range(n):
subst = atomDistance(string1[i], string2[j])
if string1[i]==string2[j]:
submove = 'Ok(%s)'%string1[i]
else:
submove = 'Sub(%s->%s)'%(string1[i],string2[j])
cost_so_far, move, tb = d[i][j]
cost_with_sub = cost_so_far + subst
cost_so_far, move, tb = d[i+1][j]
cost_with_ins = cost_so_far + insCost
insmove = 'Ins(%s)'%string2[j]
cost_so_far, move, tb = d[i][j+1]
cost_with_del = cost_so_far + delCost
delmove = 'Del(%s)'%string1[i]
if cost_with_sub < cost_with_ins:
if cost_with_sub < cost_with_del:
d[i+1][j+1] = cost_with_sub, submove, d[i][j]
else:
d[i+1][j+1] = cost_with_del, delmove, d[i][j+1]
else:
if cost_with_ins < cost_with_del:
d[i+1][j+1] = cost_with_ins, insmove, d[i+1][j]
else:
d[i+1][j+1] = cost_with_del, delmove, d[i][j+1]
return d[m][n]
def unravel(chain):
cost, move, previous = chain
if previous is None:
return [move]
else:
return unravel(previous) + [move]
def showedit(s1, s2):
print "%s -> %s: %d" % (s1, s2, edit_dist(s1,s2))
path = edit_path(s1, s2)
print path
print unravel(path)
if __name__=='__main__':
import sys
if len(sys.argv)==3:
showedit(sys.argv[1],sys.argv[2])
else:
showedit("practical","rational")
Here is sample interactive use for Michel's question:
D:\pybridge>python
Python 2.0 (#8, Oct 16 2000, 17:27:58) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>> import ed
>>> edit = ed.unravel(ed.sdistr('abcdef','abQUACKcde'))
>>> for move in edit:
... print move
...
Start
Ok(a)
Ok(b)
Ins(Q)
Ins(U)
Ins(A)
Ins(C)
Ins(K)
Ok(c)
Ok(d)
Ok(e)
Del(f)
>>>
There are of course lots of little tweaks one could do to speed
this up in practice (the linked-list form of the 'chain' is neat
but probably costly, and it needs to be unraveled anyway -- and
unravel() itself might be non-recursive -- etc, etc) but I do
believe this version is quite readable and didactically valid.
Alex
More information about the Python-list
mailing list