Fuzzy Lookups
BBands
bbands at gmail.com
Mon Jan 30 16:41:06 CET 2006
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.
I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.
def distance(a,b):
c = {}
n = len(a); m = len(b)
for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j
for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]
As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?
jab
More information about the Python-list
mailing list