[Python-Dev] Why is soundex marked obsolete?
M.-A. Lemburg
mal@lemburg.com
Mon, 15 Jan 2001 12:56:37 +0100
Tim Peters wrote:
>
> [M.-A. Lemburg]
> > BTW, are there less English centric "sounds alike" matchers
> > around ?
>
> Yes, but if anything there are far too many of them: like Soundex, they're
> just heuristics, and *everybody* who cares adds their own unique twists,
> while proper studies are almost non-existent. Few variants appear to be in
> use much beyond their inventor's friends; one notable exception in the
> Jewish community is the Daitch-Mokotoff variation, originally tailored to
> their unique needs but later generalized; a brief description here:
>
> http://www.avotaynu.com/soundex.html
>
> The similarly involved NYSIIS algorithm (New York State Identification
> Intelligence System -- look for NYSIIS on Parnassus) was the winner from a
> field of about two dozen competing algorithms, after measuring their
> effectiveness on assorted databases maintained by the state of New York.
> Since New York has a large immigrant population, NYSIIS isn't as
> Anglocentric as Soundex either.
Thanks for the pointer. I'll add that module to my lib :)
http://metagram.webreply.com/downloads/nysiis.py
Perhaps Eric ought to add this one to his package as well ?!
BTW, where can I find your package on the web, Eric ? I'd like
to give it a ride under German language conditions ;)
> But state-of-the-art has given up on purely computational algorithms for
> these purposes: proper names are simply too much a mess. For example, if I
> search for "Richard", it *ought* to match on "Dick"; if my Arab buddy
> searches on "Mohammed", it *ought* to match on "Mhd"; "the rules" people
> actually use just aren't reducible to pure computation -- it takes a large
> knowledge base to capture what people "just know". You may enjoy visiting
> this commercial site (AFAIK, nobody is giving away state-of-the-art for
> free):
>
> http://www.las-inc.com/
Sad -- "patent pending" algorithms don't help anyone on this
planet :(
> > ...
> > http://physics.nist.gov/cuu/Reference/soundex.html
> >
> > works fine for English texts,
>
> If that were true, the English-speaking researchers would have declared
> victory 120 years ago <wink>. But English pronunciation is *notoriously*
> difficult to predict from spelling, partly because English is the Perl of
> human languages.
Then Dutch must be the Python of human languages... ;)
--
Marc-Andre Lemburg
______________________________________________________________________
Company: http://www.egenix.com/
Consulting: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/