[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
Sun, 14 Jan 2001 07:15:33 -0500


Tim Peters <tim.one@home.com>:
> Yet the narrower the domain, the less call for a library with multiple
> approaches.  If R-O really shone for you, why bother with anything else?

Well, I was bothering with Levenshtein because *you* suggested it. :-)

I put in Hamming similarity and stemming because they're O(n) where
R/O is quadratic, and both widely used in situations where a fast sloppy
job is preferable to a good but slow one.  My documentation page is explicit
about the tradeoff.

> Seriously.  You haven't used some (most?) of these. 

I've used stemming and R-O.  Haven't used Hamming or Levenshtein.

>                                   The core isn't a place
> for research modules either (note that I have no objection whatsoever to
> writing any module you like -- the only question here is what belongs in the
> core, and any algorithm *nobody* here has experience with in your target
> domain is plainly a poor *core* candidate for that reason alone -- we have
> to maintain, justify and explain it for years to come).

Fair point.  I read it, in this context, as good advice to drop the Hamming 
entry point and forget about the Levenshtein implementation -- stick to what
I've used and know is useful as opposed to what I think might be useful.

>                                                I may well agree you don't
> need all that heavy machinery if I had a clear definition of what problem it
> is you're trying to solve (I've learned it's not the kinds of problems *I*
> had in mind when I first read your description!).

I think you have it by now, judging by the following...

> What happens when the user doesn't enter an exact match?  Does the kind of
> app you have in mind then just present them with a list of choices? 

Yes.  I've used this technique a lot.  It gives users not just guidance 
but warm fuzzy feelings -- they react as though there's a friendly 
homunculus inside the software looking out for them.  Actually, in my
experience, the less techie they are the more they like this.

> If that's all (as opposed to, e.g., substituting its best guess for what the
> user actually typed and proceeding as if the user had given that from the
> start), then the evidence from studies says users are almost as pleased when
> the correct choice appears somewhere in the first three choices as when it
> appears as *the* top choice.

Interesting.  That does fit what I've seen.

>                    A well-designed vocabulary can almost
> guarantee that happy result (note that most of the current research is aimed
> at the much harder job of getting the intended word into the #1 slot on the
> choice list).

Yes.  One of my other tricks is to design command vocabularies so the
first three characters close to unique.  This means R/O will almost
always nail the right thing.

> Much as in typing, some mutations are more likely than others for *physical*
> reasons, so treating all pairs of symbols in the alphabet alike is too gross
> a simplification.).

Indeed.  Couple weeks ago I was a speaker at a conference called "After the
Genome 6" at which one of the most interesting papers was given by a lady
mathematician who designs algorithms for DNA sequence matching.  She made
exactly this point.

> > That's why good documentation, with motivating usage hints, is
> > important.  I write good documentation, Tim.
> 
> You're not going to find offense here even if you look for it, Eric <wink>:

No worries, I wasn't looking. :-)

> Most people will be looking for "a solution", not for "a toolkit".  If the
> docs read like a toolkit, it doesn't matter how good they are, the bulk of
> the people you're trying to reach will pass on it.  If you really want this
> to be *used*, supply one class that does *all* the work, including making
> the expert-level choices of which algorithm is used under the covers and how
> it's tuned.  That's good advice.

I don't think that's possible in this case -- the proper domains for
stemming and R-O are too different.  But maybe this is another nudge to drop
the Hamming code.

>       But to get it under his radar, it's again much easier if the usage
> docs are no longer than a couple paragraphs.

How's this?

\section{\module{simil} -- 
         String similarily metrics}

\declaremodule{standard}{simil}
\moduleauthor{Eric S. Raymond}{esr@thyrsus.com}
\modulesynopsis{String similarity metrics.}

\sectionauthor{Eric S. Raymond}

The \module{simil} module provides similarity functions for
approximate word or string matching.  One important application is for
checking input words against a dictionary to match possible
misspellings with the right terms in a controlled vocabulary.

The entry points provide different tradeoffs ranging from crude and
fast (stemming) to effective but slow (Ratcliff-Obershelp gestalt
subpattern matching).  The latter is one of the standard techniques
used in commercial OCR software.

The \module{simil} module defines the following functions:

\begin{funcdesc}{stem}{}
Returns the length of the longest common prefix of two strings divided
by the length of the longer.  Similarity scores range from 0.0 (no
common prefix) to 1.0 (identity).  Running time is linear in string
length.
\end{funcdesc}

\begin{funcdesc}{hamming}{}
Computes a normalized Hamming similarity between two strings of equal
length -- the number of pairwise matches in the strings, divided by
their common length.  It returns None if the strings are of unequal
length.  Similarity scores range from 0.0 (no positions equal) to 1.0
(identity).  Running time is linear in string length.
\end{funcdesc}

\begin{funcdesc}{ratcliff}{}
Returns a Ratcliff/Obershelp gestalt similarity score based on
co-occurrence of subpatterns.  Similarity scores range from 0.0 (no
common subpatterns) to 1.0 (identity).  Running time is best-case
linear, worst-case quadratic in string length.
\end{funcdesc}

> Module name? http
> Hmm.  My best guesses are httplib, tty, stat
> 
> [I was thinking of httplib, but note that it missed
>  SimpleHTTPServer:  a name that long just isn't going to score
>  high when the input is that short]

>>> simil.ratcliff("http", "httplib")
0.72727274894714355
>>> simil.ratcliff("http", "tty")
0.57142859697341919
>>> simil.ratcliff("http", "stat")
0.5
>>> simil.ratcliff("http", "simplehttpserver")
0.40000000596046448

So with the 0.6 threshold I normally use R-O does better at eliminating
the false matches but doesn't catch SimpleHTTPServer (case is, I'm
sure you'll agree, an irrelevant detail here).
 
> Module name? dictionary
> Hmm.  My best guesses are Bastion, ConfigParser, tabnanny
> 
> [darn, I *think* I was thinking of UserDict there]

>>> simil.ratcliff("dictionary", "bastion")
0.47058823704719543
>>> simil.ratcliff("dictionary", "configparser")
0.45454546809196472
>>> simil.ratcliff("dictionary", "tabnanny")
0.4444444477558136
>>> simil.ratcliff("dictionary", "userdict")
0.4444444477558136

R-O would have booted all of these.  Hiighest score to configparser.
Interesting -- I'm beginning to think R-O overweights lots of small
subpattern matches relative to a few big ones, something I didn't notice
before because the statistics of my vocabularies masked it.

> Module name? uuencode
> Hmm.  My best guesses are code, codeop, codecs

>>> simil.ratcliff("uuencode", "code")
0.66666668653488159
>>> simil.ratcliff("uuencode", "codeops")
0.53333336114883423
>>> simil.ratcliff("uuencode", "codecs")
0.57142859697341919
>>> simil.ratcliff("uuencode", "uu")
0.40000000596046448

R-O would pick "code" and boot the rest.

> [Missed uu]
> 
> Module name? parse
> Hmm.  My best guesses are tzparse, urlparse, pre

>>> simil.ratcliff("parse", "tzparse")
0.83333331346511841
>>> simil.ratcliff("parse", "urlparse")
0.76923078298568726
>>> simil.ratcliff("parse", "pre")
0.75

Same result.

> Module name? browser
> Hmm.  My best guesses are webbrowser, robotparser, user

>>> simil.ratcliff("browser", "webbrowser")
0.82352942228317261
>>> simil.ratcliff("browser", "robotparser")
0.55555558204650879
>>> simil.ratcliff("browser", "user")
0.54545456171035767

Big win for R-O.  Picks the right one, boots the wrong two.

> Module name? brower
> Hmm.  My best guesses are webbrowser, repr, reconvert

>>> simil.ratcliff("brower", "webbrowser")
0.75
>>> simil.ratcliff("brower", "repr")
0.60000002384185791
>>> simil.ratcliff("brower", "reconvert")
0.53333336114883423

Small win for R/O -- boots reconvert, and repr squeaks in under the wire.

> Module name? Thread
> Hmm.  My best guesses are threading, whrandom, sched

>>> simil.ratcliff("thread", "threading")
0.80000001192092896
>>> simil.ratcliff("thread", "whrandom")
0.57142859697341919
>>> simil.ratcliff("thread", "sched")
0.54545456171035767

Big win for R-O.

> Module name? pickle
> Hmm.  My best guesses are pickle, profile, tempfile

>>> simil.ratcliff("pickle", "pickle")
1.0
>>> simil.ratcliff("pickle", "profile")
0.61538463830947876
>>> simil.ratcliff("pickle", "tempfile")
0.57142859697341919

R-O wins again.

> (BTW, the first choice was an exact match)
> Module name? shelf
> Hmm.  My best guesses are shelve, shlex, sched

>>> simil.ratcliff("shelf", "shelve")
0.72727274894714355
>>> simil.ratcliff("shelf", "shlex")
0.60000002384185791
>>> simil.ratcliff("shelf", "sched")
0.60000002384185791

Interesting.  Shelve scoores highest, both the others squeak in.

> Module name? katmandu
> Hmm.  My best guesses are commands, random, anydbm
>
> [I really was thinking of "commands"!]

>>> simil.ratcliff("commands", "commands")
1.0
>>> simil.ratcliff("commands", "random")
0.4285714328289032
>>> simil.ratcliff("commands", "anydbm")
0.4285714328289032

R-O wins big.
 
> Module name? temporary
> Hmm.  My best guesses are tzparse, tempfile, fpformat

>>> simil.ratcliff("temporary", "tzparse")
0.5
>>> simil.ratcliff("temporary", "tempfile")
0.47058823704719543
>>> simil.ratcliff("temporary", "fpformat")
0.47058823704719543

R-O boots all of these.  

> Hard question:  is that "good enough" for what you want?

Um...notice that R-O filtering, even though it seems to be
underweighting large matches, did a rather better job on your examples!
With an 0.66 threshold it would have done *much* better.

I think you've just made an argument for replacing your SequenceMatcher
with simil.ratcliff.  Mine's even documented. :-).
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Militias, when properly formed, are in fact the people themselves and
include all men capable of bearing arms. [...] To preserve liberty it is
essential that the whole body of the people always possess arms and be
taught alike, especially when young, how to use them.
        -- Senator Richard Henry Lee, 1788, on "militia" in the 2nd Amendment