[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
Sat, 13 Jan 2001 17:15:28 -0500


OK, now I understand why soundex isn't in the core -- there's no canonical 
version.

Tim Peters <tim.one@home.com>:
> + There are any number of algorithms people may want to see (I don't know
> what "normalized Hamming similarity" means, but if it's not the same as
> Levenshtein edit distance then add the latter to the pot too).

Normalized Hamming similarity: it's an inversion of Hamming distance
-- number of pairwise matches in two strings of the same length,
divided by the common string length.  Gives a measure in [0.0, 1.0].

I've looked up "Levenshtein edit distance" and you're rigbt.  I'll add it
as a fourth entry point as soon as I can find C source to crib.  (Would
you happen to have a pointer?)

> + Each algorithm on its own is likely controversial.

Not these.  There *are* canonical versions of all these, and exact
equivalents are all heavily used in commercial OCR software.

> + Computing string similarity is something few apps need anyway.

Tim, this isn't true.  Any time you need to validate user input
against a controlled vocabulary and give feedback on probable right
choices, R/O similarity is *very* useful.  I've had it in my personal
toolkit for a decade and used it heavily for this -- you take your
unknown input, check it against a dictionary and kick "maybe you meant
foo?" to the user for every foo with an R/O similarity above 0.6 or so.

The effects look like black magic.  Users love it.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

"I hold it, that a little rebellion, now and then, is a good thing, and as 
necessary in the political world as storms in the physical."
	-- Thomas Jefferson, Letter to James Madison, January 30, 1787