Fuzzy Lookups
Gregory Piñero
gregpinero at gmail.com
Mon Jan 30 23:10:13 EST 2006
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)
import difflib
CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.
-Greg
On 30 Jan 2006 17:28:47 -0800, ajones <ajones1 at gmail.com> wrote:
>
> BBands wrote:
> > I have some CDs and have been archiving them on a PC. I wrote a Python
> > script that spans the archive and returns a list of its contents:
> > [[genre, artist, album, song]...]. I wanted to add a search function to
> > locate all the versions of a particular song. This is harder than you
> > might think. For example the Cajun "national anthem" is Jolie Blond,
> > but it can be spelled several different ways jolie, joli, blon, blond,
> > etc... In addition the various online services that provide song info
> > are riddled with typos, so an ordinary string match just doesn't get
> > you there. What is needed is a fuzzy string match and it turns out that
> > there is a very good one, the Levenshtein distance, which is the number
> > of inserts, deletions and substitutions needed to morph one string into
> > another. In my application I match the desired song title against all
> > song titles in my list and return the ones with the lowest Levenshtein
> > distances. This is remarkably, one might even say stunningly,
> > effective, easily finding all the version of Jolie Blon in the list.
> >
> > I am using the following snippet (found on the web, proper attribution
> > unsure), which calculates the Levenshtein distance.
> >
> > def distance(a,b):
> > c = {}
> > n = len(a); m = len(b)
> >
> > for i in range(0,n+1):
> > c[i,0] = i
> > for j in range(0,m+1):
> > c[0,j] = j
> >
> > for i in range(1,n+1):
> > for j in range(1,m+1):
> > x = c[i-1,j]+1
> > y = c[i,j-1]+1
> > if a[i-1] == b[j-1]:
> > z = c[i-1,j-1]
> > else:
> > z = c[i-1,j-1]+1
> > c[i,j] = min(x,y,z)
> > return c[n,m]
> >
> > As mentioned above this works quite well and I am happy with it, but I
> > wonder if there is a more Pythonic way of doing this type of lookup?
> >
> > jab
>
> Here is my stab at it, didn't fully test it so it may not work
> correctly. The tuples could be avoided by using len(x), but the extra
> lookups may cost in execution speed[1].
>
> def distance(a, b):
> """
> Computes the levenshtein distance between two strings
> """
> m, n = (len(a),a), (len(b),b)
> if(m[0] < n[0]): #ensure that the 'm' tuple holds
> the longest string
> m, n = n, m
> dist = m[0] #assume distance = length of
> longest string (worst case)
> for i in range(0, n[0]): # reduce the distance for each char
> match in shorter string
> if m[1][i] == n[1][i]:
> dist = dist - 1
> return dist
>
> [1] I have no if this is true or not. I can see arguments for both.
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
More information about the Python-list
mailing list