Fuzzy Lookups

Kent Johnson kent at kentsjohnson.com
Tue Jan 31 15:54:42 EST 2006


Gregory Piñero wrote:
> Ok, ok, I got it!  The Pythonic way is to use an existing library ;-)
> 
> import difflib
> CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
> 
> I wrote a script to delete duplicate mp3's by filename a few years
> back with this.  If anyone's interested in seeing it, I'll post a blog
> entry on it.  I'm betting it uses a similiar algorithm your functions.

A quick trip to difflib.py produces this description of the matching 
algorithm:

   The basic
   algorithm predates, and is a little fancier than, an algorithm
   published in the late 1980's by Ratcliff and Obershelp under the
   hyperbolic name "gestalt pattern matching".  The basic idea is to find
   the longest contiguous matching subsequence that contains no "junk"
   elements (R-O doesn't address junk).  The same idea is then applied
   recursively to the pieces of the sequences to the left and to the
   right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent



More information about the Python-list mailing list