[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sun, 14 Jan 2001 14:46:44 -0500


[M.-A. Lemburg]
> BTW, are there less English centric "sounds alike" matchers
> around ?

Yes, but if anything there are far too many of them:  like Soundex, they're
just heuristics, and *everybody* who cares adds their own unique twists,
while proper studies are almost non-existent.  Few variants appear to be in
use much beyond their inventor's friends; one notable exception in the
Jewish community is the Daitch-Mokotoff variation, originally tailored to
their unique needs but later generalized; a brief description here:

    http://www.avotaynu.com/soundex.html

The similarly involved NYSIIS algorithm (New York State Identification
Intelligence System -- look for NYSIIS on Parnassus) was the winner from a
field of about two dozen competing algorithms, after measuring their
effectiveness on assorted databases maintained by the state of New York.
Since New York has a large immigrant population, NYSIIS isn't as
Anglocentric as Soundex either.

But state-of-the-art has given up on purely computational algorithms for
these purposes:  proper names are simply too much a mess.  For example, if I
search for "Richard", it *ought* to match on "Dick"; if my Arab buddy
searches on "Mohammed", it *ought* to match on "Mhd"; "the rules" people
actually use just aren't reducible to pure computation -- it takes a large
knowledge base to capture what people "just know".  You may enjoy visiting
this commercial site (AFAIK, nobody is giving away state-of-the-art for
free):

    http://www.las-inc.com/

> ...
>     http://physics.nist.gov/cuu/Reference/soundex.html
>
> works fine for English texts,

If that were true, the English-speaking researchers would have declared
victory 120 years ago <wink>.  But English pronunciation is *notoriously*
difficult to predict from spelling, partly because English is the Perl of
human languages.

or-maybe-the-borg-assuming-there's-a-difference<wink>-ly y'rs  - tim