Fuzzy Lookups

Diez B. Roggisch deets at nospam.web.de
Mon Jan 30 17:30:06 CET 2006


Fredrik Lundh wrote:

> Diez B. Roggisch wrote:
> 
>> The advantage becomes apparent when you try to e.g. compare
>>
>> "Angelina Jolie"
>>
>> with
>>
>> "AngelinaJolei"
>>
>> and
>>
>> "Bob"
>>
>> Both have a l-dist of 3
> 
>>>> distance("Angelina Jolie", "AngelinaJolei")
> 3
>>>> distance("Angelina Jolie", "Bob")
> 13
> 
> what did I miss ?

Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki/Levenshtein_distance 

it has the property 

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from  Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to 

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez



More information about the Python-list mailing list