[Python-Dev] Why is soundex marked obsolete?
Eric S. Raymond
esr@thyrsus.com
Sat, 13 Jan 2001 19:58:08 -0500
Tim Peters <tim.one@home.com>:
> If you throw almost everything out of Unix diff, that's what you'll be left
> with. Offhand I don't know of enencumbered, industrial-strength C source; a
> problem is that writing a program to compute this is a std homework exercise
> (it's a common first "dynamic programming" example), so you can find tons of
> bad C source.
I found some formal descriptions of the algorithm and some unencumbered
Oberon source. I'm coding up C now. It's not complicated if you're willing
to hold the cost matrix in memory, which is reasonable for a string comparator
in a way it wouldn't be for a file diff.
> Caution: many people want small variations of "edit distance", usually via
> assigning different weights to insertions, replacements and deletions. A
> less common but still popular variant is to say that a transposition ("xy"
> vs "yx") is less costly than a delete plus an insert. Etc. "edit distance"
> is really a family of algorithms.
Which about collapse into one if your function has three weight
arguments for insert/replace/delete weights, as mine does. It don't
get more general than that -- I can see that by looking at the formal
description.
OK, so I'll give you that I don't weight transpositions separately,
but neither does any other variant I found on the web nor the formal
descriptions. A fourth optional weight agument someday, maybe :-).
> God forbid that core Python may lose the commercial OCR developer market
> <wink>. It's not accepted that for every field F, core Python needs to
> supply the algorithms F uses heavily.
That's not my point -- I don't see OCR as a big Python market either.
My point in observing that OCR uses Ratcliff/Obershelp heavily was
simplty to show that it's a well-established algorithm, not
`controversial'.
> Heck, core Python doesn't even ship
> with an FFT! Doesn't bother the folks working in signal processing.
It probably won't surprise you that I considered writing an FFT extension
module at one point :-).
> > Tim, this isn't true. Any time you need to validate user input
> > against a controlled vocabulary and give feedback on probable right
> > choices,
>
> Which is something few apps need anyway
I fundamentally disagree. Few application designers *know* they need
it, but user interfaces would get a hell of a lot better if the
technique were more commonly applied -- and that's why I want it in
the Python library, so doing the right thing in Python will be a
minimum-effort proposition.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
What if you were an idiot, and what if you were a member of Congress?
But I repeat myself.
-- Mark Twain