[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sat, 13 Jan 2001 18:59:44 -0500

> OK, now I understand why soundex isn't in the core -- there's no
> canonical version.

Actually, I think Knuth Vol 3 Ed 3 is canonical *now* -- nobody would dare
to oppose him <0.5 wink>.

> Normalized Hamming similarity: it's an inversion of Hamming distance
> -- number of pairwise matches in two strings of the same length,
> divided by the common string length.  Gives a measure in [0.0, 1.0].
> I've looked up "Levenshtein edit distance" and you're rigbt.  I'll add
> it as a fourth entry point as soon as I can find C source to crib.
> (Would you happen to have a pointer?)

If you throw almost everything out of Unix diff, that's what you'll be left
with.  Offhand I don't know of enencumbered, industrial-strength C source; a
problem is that writing a program to compute this is a std homework exercise
(it's a common first "dynamic programming" example), so you can find tons of
bad C source.

Caution:  many people want small variations of "edit distance", usually via
assigning different weights to insertions, replacements and deletions.  A
less common but still popular variant is to say that a transposition ("xy"
vs "yx") is less costly than a delete plus an insert.  Etc.  "edit distance"
is really a family of algorithms.

>> + Each algorithm on its own is likely controversial.

> Not these.  There *are* canonical versions of all these,

See the "edit distance" gloss above.

> and exact equivalents are all heavily used in commercial OCR
> software.

God forbid that core Python may lose the commercial OCR developer market
<wink>.  It's not accepted that for every field F, core Python needs to
supply the algorithms F uses heavily.  Heck, core Python doesn't even ship
with an FFT!  Doesn't bother the folks working in signal processing.

>> + Computing string similarity is something few apps need anyway.

> Tim, this isn't true.  Any time you need to validate user input
> against a controlled vocabulary and give feedback on probable right
> choices,

Which is something few apps need anyway -- in my experience, but more so in
my *primary* role here of trying to channel for you (& Guido) what Guido
will say.  It should be clear that I've got some familiarity with these
schemes, so it should also be clear that Guido is likely to ask me about
them whenever they pop up.  But Guido has hardly ever asked me about them
over the past decade, with the exception of the short-lived Soundex
brouhaha.  From that I guess hardly anyone ever asks *him* about them, and
that's how channeling works:  if this were an area where Guido felt core
Python needed beefier libraries, I'm pretty sure I would have heard about it
by now.

But now Guido can speak for himself.  There's no conceivable argument that
could change what I *predict* he'll say.

> R/O similarity is *very* useful.  I've had it in my personal
> toolkit for a decade and used it heavily for this -- you take your
> unknown input, check it against a dictionary and kick "maybe you meant
> foo?" to the user for every foo with an R/O similarity above 0.6 or so.
> The effects look like black magic.  Users love it.

I believe that.  And I'd guess we all have things in our personal toolkits
our users love.  That isn't enough to get into the core, as I expect Guido
will belabor on the next iteration of this <wink>.

doesn't-mean-the-code-isn't-mondo-cool-ly y'rs  - tim