[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
Sat, 13 Jan 2001 19:58:08 -0500


Tim Peters <tim.one@home.com>:
> If you throw almost everything out of Unix diff, that's what you'll be left
> with.  Offhand I don't know of enencumbered, industrial-strength C source; a
> problem is that writing a program to compute this is a std homework exercise
> (it's a common first "dynamic programming" example), so you can find tons of
> bad C source.

I found some formal descriptions of the algorithm and some unencumbered 
Oberon source.  I'm coding up C now.  It's not complicated if you're willing 
to hold the cost matrix in memory, which is reasonable for a string comparator
in a way it wouldn't be for a file diff.
 
> Caution:  many people want small variations of "edit distance", usually via
> assigning different weights to insertions, replacements and deletions.  A
> less common but still popular variant is to say that a transposition ("xy"
> vs "yx") is less costly than a delete plus an insert.  Etc.  "edit distance"
> is really a family of algorithms.

Which about collapse into one if your function has three weight
arguments for insert/replace/delete weights, as mine does.  It don't
get more general than that -- I can see that by looking at the formal
description.  

OK, so I'll give you that I don't weight transpositions separately,
but neither does any other variant I found on the web nor the formal
descriptions.  A fourth optional weight agument someday, maybe :-).

> God forbid that core Python may lose the commercial OCR developer market
> <wink>.  It's not accepted that for every field F, core Python needs to
> supply the algorithms F uses heavily.

That's not my point -- I don't see OCR as a big Python market either.
My point in observing that OCR uses Ratcliff/Obershelp heavily was
simplty to show that it's a well-established algorithm, not
`controversial'.

>                      Heck, core Python doesn't even ship
> with an FFT!  Doesn't bother the folks working in signal processing.

It probably won't surprise you that I considered writing an FFT extension
module at one point :-).  

> > Tim, this isn't true.  Any time you need to validate user input
> > against a controlled vocabulary and give feedback on probable right
> > choices,
> 
> Which is something few apps need anyway

I fundamentally disagree.  Few application designers *know* they need
it, but user interfaces would get a hell of a lot better if the
technique were more commonly applied -- and that's why I want it in
the Python library, so doing the right thing in Python will be a
minimum-effort proposition.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

What if you were an idiot, and what if you were a member of Congress?
But I repeat myself.
        -- Mark Twain