[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
Sun, 14 Jan 2001 02:08:57 -0500


Tim Peters <tim.one@home.com>:
> All agreed, and it should be a straightforward task then.  I'm assuming it
> will work with Unicode strings too <wink>.

Thought about that.  Want to get it working for 8 bits first.
 
> Guido will depart from you at a different point.  I depart here:  it's not
> "the right thing".  It's a bunch of hacks that appeal not because they solve
> a problem, but because they're cute algorithms that are pretty easy to
> implement and kinda solve part of a problem.

Again, my experience says differently.  I have actually *used*
Ratcliff-Obershelp to implement Do What I Mean (actually, Tell Me What
I Mean) -- and had it work very well for non-geek users.  That's why I
want other Python programmers to have easy access to the capability.

> Working six years in commercial speech recog really hammered that home to
> me:  95% solutions are on the margin of unsellable, because an error one try
> in 20 is intolerable for real people.  Developers writing for developers get
> "whoa! cool!" where my sisters walk away going "what good is that?".  Edit
> distance doesn't get within screaming range of 95% in real life.

I suspect your speech recognition experience has given you an
unhelpful bias.  For English, what you say is certainly true -- but
that's a gross worst-case application of R/O and Levenshtein that I'm
not interested in pursuing.  Nor do I expect Python hackers to use
my module for that.

Where techniques like Ratcliff-Obershelp really shine (and what I
expect the module to be used for) is with controlled vocabularies such
as command interfaces.  These tend to have better orthogonality than
NL, so antinoise filtering by R/O or Levenshtein distance (a kindred
technique I somehow didn't learn until today -- there are
disadvantages to being an autodidact) can really go to town on them.

(Actually, my gut after thinking about both algorithms hard is that
R/O is still a better technique than Levenshtein for the kind of
application I have in mind.  But I also suspect the difference is
marginal.)

(Other good uses for algorithms in this class include cladistics and
genomic analysis.)

> Even for most developers, it would be better to package up the single best
> approach you've got (f(list, word) -> list of possible matches sorted in
> confidence order), instead of a module with 6 (or so) functions they don't
> understand and a pile of equally mysterious knobs.

That's why good documentation, with motivating usage hints, is important.
I write good documentation, Tim.

>     PATTERN RECOGNITION OF STRINGS WITH SUBSTITUTIONS, INSERTIONS,
>     DELETIONS AND GENERALIZED TRANSPOSITIONS
>     B. J. Oommen and R. K. S. Loke
>     http://www.scs.carleton.ca/~oommen/papers/GnTrnsJ2.PDF

Thanks for the pointer; I've downloaded it and will read it.  If the 
description of Ooomen's algorithm is good enough, I'll implement it and
add it to the module.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Power concedes nothing without a demand. It never did, and it never will.
Find out just what people will submit to, and you have found out the exact
amount of injustice and wrong which will be imposed upon them; and these will
continue until they are resisted with either words or blows, or with both.
The limits of tyrants are prescribed by the endurance of those whom they
oppress.
	-- Frederick Douglass, August 4, 1857