[Python-Dev] Why is soundex marked obsolete?

Sat, 13 Jan 2001 23:39:59 -0500

[Eric, on "edit distance"]
> I found some formal descriptions of the algorithm and some
> unencumbered Oberon source.  I'm coding up C now.  It's not
> complicated if you're willing to hold the cost matrix in memory,
> which is reasonable for a string comparator in a way it wouldn't
> be for a file diff.

All agreed, and it should be a straightforward task then.  I'm assuming it
will work with Unicode strings too <wink>.

[on differing weights]
> Which about collapse into one if your function has three weight
> arguments for insert/replace/delete weights, as mine does.  It don't
> get more general than that -- I can see that by looking at the formal
> description.
>
> OK, so I'll give you that I don't weight transpositions separately,
> but neither does any other variant I found on the web nor the formal
> descriptions.  A fourth optional weight agument someday, maybe :-).
> ...
> and that's why I want it in the Python library, so doing the right
> thing in Python will be a minimum-effort proposition.

Guido will depart from you at a different point.  I depart here:  it's not
"the right thing".  It's a bunch of hacks that appeal not because they solve
a problem, but because they're cute algorithms that are pretty easy to
implement and kinda solve part of a problem.   "The right thing"-- which you
can buy --at least involves capturing a large base of knowledge about
phonetics and spelling.  In high school, one of my buddies was Dan
Pryzbylski.  If anyone who knew him (other than me <wink>) were to type his
name into the class reunion guy's web page, they'd probably spell it the way
they remember him pronouncing it:  sha-bill-skey (and that's how he
pronounced "Dan" <wink>).  If that hit on the text string "Pryzbylski",
*then* it would be "the right thing" in a way that makes sense to real
people, not just to implementers.

Working six years in commercial speech recog really hammered that home to
me:  95% solutions are on the margin of unsellable, because an error one try
in 20 is intolerable for real people.  Developers writing for developers get
"whoa! cool!" where my sisters walk away going "what good is that?".  Edit
distance doesn't get within screaming range of 95% in real life.

Even for most developers, it would be better to package up the single best
approach you've got (f(list, word) -> list of possible matches sorted in
confidence order), instead of a module with 6 (or so) functions they don't
understand and a pile of equally mysterious knobs.  Then it may actually get
used!  Developers of the breed who would actually take the time to
understand what you've done are, I suggest, similar to us:  they'd skim the
docs, ignore the code, and write their own variations.  Or, IOW:

> so doing the right thing in Python will be a minimum-effort
> proposition.

Make someone think first, and 95% of developers will just skip over it too.

BTW, the theoretical literature ignored transposition at first, because it
didn't fit well in the machinery.  IIRC, I first read about it in an issue
of SP&E (Software Practice & Experience), where the authors were forced into
it because the "traditional" edit sequence measure sucked in their practice.
They were much happier after taking transposition into account.  The
theoreticians have more than caught up since, and research is still active;
e.g., 1997's

    PATTERN RECOGNITION OF STRINGS WITH SUBSTITUTIONS, INSERTIONS,
    DELETIONS AND GENERALIZED TRANSPOSITIONS
    B. J. Oommen and R. K. S. Loke
    http://www.scs.carleton.ca/~oommen/papers/GnTrnsJ2.PDF

is a good read.  As they say there,

    If one views the elements of the confusion matrices as
    probabilities, this [treating each character independent
    of all others, as "edit distance" does] is equivalent to
    assuming that the transformation probabilities at each
    position in the string are statistically independent and
    possess first-order Markovian characteristics. This model
    is usually assumed for simplicity rather it [sic] having
    any statistical significance.

IOW, because it's easy to analyze, not because it solves a real problem --
and they're complaining about an earlier generalization of edit distance
that makes the weights depend on the individual symbols involved as well as
on the edit/delete/insert distinction (another variation trying to make this
approach genuinely useful in real life).  The Oommen-Loke algorithm appears
much more realistic, taking into account the observed probabilities of
mistyping specific letter pairs (although it still ignores phonetics), and
they report accuracies approaching 98% in correctly identifying mangled
words.

98% (more than twice as good as 95% -- the error rate is actually more
useful to think about, 2% vs 5%) is truly useful for non-geek end users, and
the state of the art here is far beyond what's easy to find and dead easy to
implement.

> ...
> It probably won't surprise you that I considered writing an FFT
> extension module at one point :-).

Nope!  More power to you, Eric.  At least FFTs *are* state of the art,
although *coding* them optimally is likely beyond human ability on modern
machines:

    http://www.fftw.org/

(short course:  they've generally got the fastest FFTs available, and their
code is generated by program, systematically *trying* every trick in the
book, timing it on a given box, and synthesizing a complete strategy out of
the quickest pieces).

sooner-or-later-the-only-code-real-people-will-use-won't-be-written-
    by-people-at-all-ly y'rs  - tim