Fuzzy Lookups

Gregory Piñero gregpinero at gmail.com
Tue Jan 31 22:24:46 CET 2006


I wonder which algorithm determines the similarity between two strings better?

On 1/31/06, Kent Johnson <kent at kentsjohnson.com> wrote:
> Gregory Piñero wrote:
> > Ok, ok, I got it!  The Pythonic way is to use an existing library ;-)
> >
> > import difflib
> > CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
> >
> > I wrote a script to delete duplicate mp3's by filename a few years
> > back with this.  If anyone's interested in seeing it, I'll post a blog
> > entry on it.  I'm betting it uses a similiar algorithm your functions.
>
> A quick trip to difflib.py produces this description of the matching
> algorithm:
>
>    The basic
>    algorithm predates, and is a little fancier than, an algorithm
>    published in the late 1980's by Ratcliff and Obershelp under the
>    hyperbolic name "gestalt pattern matching".  The basic idea is to find
>    the longest contiguous matching subsequence that contains no "junk"
>    elements (R-O doesn't address junk).  The same idea is then applied
>    recursively to the pieces of the sequences to the left and to the
>    right of the matching subsequence.
>
> So no, it doesn't seem to be using Levenshtein distance.
>
> Kent
> --
> http://mail.python.org/mailman/listinfo/python-list
>


--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)



More information about the Python-list mailing list