[Python-Dev] Why is soundex marked obsolete?
Eric S. Raymond
esr@thyrsus.com
Sat, 13 Jan 2001 17:15:28 -0500
OK, now I understand why soundex isn't in the core -- there's no canonical
version.
Tim Peters <tim.one@home.com>:
> + There are any number of algorithms people may want to see (I don't know
> what "normalized Hamming similarity" means, but if it's not the same as
> Levenshtein edit distance then add the latter to the pot too).
Normalized Hamming similarity: it's an inversion of Hamming distance
-- number of pairwise matches in two strings of the same length,
divided by the common string length. Gives a measure in [0.0, 1.0].
I've looked up "Levenshtein edit distance" and you're rigbt. I'll add it
as a fourth entry point as soon as I can find C source to crib. (Would
you happen to have a pointer?)
> + Each algorithm on its own is likely controversial.
Not these. There *are* canonical versions of all these, and exact
equivalents are all heavily used in commercial OCR software.
> + Computing string similarity is something few apps need anyway.
Tim, this isn't true. Any time you need to validate user input
against a controlled vocabulary and give feedback on probable right
choices, R/O similarity is *very* useful. I've had it in my personal
toolkit for a decade and used it heavily for this -- you take your
unknown input, check it against a dictionary and kick "maybe you meant
foo?" to the user for every foo with an R/O similarity above 0.6 or so.
The effects look like black magic. Users love it.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
"I hold it, that a little rebellion, now and then, is a good thing, and as
necessary in the political world as storms in the physical."
-- Thomas Jefferson, Letter to James Madison, January 30, 1787