Fuzzy string matching?

John Machin sjmachin at lexicon.net
Sat Aug 28 01:18:47 CEST 1999


On 27 Aug 99, at 9:52, Fredrik Lundh wrote: [[pruned]]
> Hans Nowak <hnowak at cuci.nl> wrote: [[pruned]]
>> I expect that some of their input will not
> > be recognized in the card database. In such cases, I would like to
> > present the user with a list of card names which match the input
> > closely, but not entirely... strings that 'almost match'. OK, so it's
> > just a fuzzy string match. Is such a thing done in Python already? If
> > not, would it be difficult to do? 
> 
> the soundex module (or a similar technique, tuned to
> your dictionary) could be what you need:
> 
> import soundex
> print soundex.sound_similar("fredrik", "frederick")
> 1
> print soundex.sound_similar("fredrik", "hans")
> 0

Folks,

In my opinion, Soundex is generally far too hit-and-miss to be used in 
what is essentially a simple spelling-and-typo-correction application.

The basic problem is that it gives a yes-no answer. The proportion of 
false-positives and false-negatives will usually be found to be too 
high.

"Standard" soundex uses a four-byte key. I once had to repair the 
customer search component of an application --- among several 
sillinesses, it was using soundex(surname) as the sole method of 
grouping. This gave ludicrous results like:
-- search for "Mousaferiadis" and get one useful hit PLUS bucket-loads 
of "McPherson" (false-positives)
-- search for "Moudaferiadis" (single-key typo) and get no hits even 
though "Mousaferiadis" was in the database (false-negative)

Here are some illustrative results from the Python-supplied soundex 
module, which generates a 6-byte key:

>>> soundex.get_soundex("McPherson")
'M21625'
>>> soundex.get_soundex("Mousaferiadis")
'M21632'
>>> soundex.get_soundex("Moudaferiadis")
'M31632'
>>> soundex.get_soundex("Mousaferiadia")
'M21630'

This module is incorrect in its handling of multiple occurrences of the 
same code:
Bug #1 removes instances of the same code when they are separated by 
vowels (or W, H, Y).
Bug #2 has the inverse problem when the input starts with a doubled 
consonant.
>>> soundex.get_soundex("Mississippi")
'M21000' # Should be M221... (bug #1)
>>> soundex.get_soundex("Lloyd")
'L43000' # Should be L300... (bug #2)
>>> soundex.get_soundex("Loyd")
'L30000' # OK
>>> soundex.get_soundex("Lukasiewicz")
'L20000' # Should be L222... (bug #1)
>>> soundex.get_soundex("Lissajous")
'L20000' # Should be L222... (bug #1)

Reference: Knuth, D.E. "The Art of Computer Programming", vol 3, page 
392.

There have been various attempts to produce a better phonetic code e.g. 
NYSIIS code, Metaphon, ... but they still give a simplistic yes/no 
answer.

In Hans' application, a method which searched the whole vocabulary 
(small size in his case) and used a similarity score based on edit 
distance (a.k.a. Levenshtein distance) is what I would recommend --- 
see posting by Magnus Lie Hetland --- and maybe a follow-up or two :-)

HTH,
John





More information about the Python-list mailing list