[Python-Dev] Why is soundex marked obsolete?
Eric S. Raymond
esr@thyrsus.com
Sun, 14 Jan 2001 14:09:01 -0500
Tim Peters <tim.one@home.com>:
> > I think you've just made an argument for replacing your
> > SequenceMatcher with simil.ratcliff.
>
> Actually, I'm certain they're the same algorithm now, except the C is
> showing through in ratcliff to the floating-point eye <wink>.
Take a look:
/*****************************************************************************
*
* Ratcliff-Obershelp common-subpattern similarity.
*
* This code first appeared in a letter to the editor in Doctor
* Dobbs's Journal, 11/1988. The original article on the algorithm,
* "Pattern Matching by Gestalt" by John Ratcliff, had appeared in the
* July 1988 issue (#181) but the algorithm was presented in assembly.
* The main drawback of the Ratcliff-Obershelp algorithm is the cost
* of the pairwise comparisons. It is significantly more expensive
* than stemming, Hamming distance, soundex, and the like.
*
* Running time quadratic in the data size, memory usage constant.
*
*****************************************************************************/
static int RatcliffObershelp(char *st1, char *end1, char *st2, char *end2)
{
register char *a1, *a2;
char *b1, *b2;
char *s1 = st1, *s2 = st2; /* initializations are just to pacify GCC */
short max, i;
if (end1 <= st1 || end2 <= st2)
return(0);
if (end1 == st1 + 1 && end2 == st2 + 1)
return(0);
max = 0;
b1 = end1; b2 = end2;
for (a1 = st1; a1 < b1; a1++)
{
for (a2 = st2; a2 < b2; a2++)
{
if (*a1 == *a2)
{
/* determine length of common substring */
for (i = 1; a1[i] && (a1[i] == a2[i]); i++)
continue;
if (i > max)
{
max = i; s1 = a1; s2 = a2;
b1 = end1 - max; b2 = end2 - max;
}
}
}
}
if (!max)
return(0);
max += RatcliffObershelp(s1 + max, end1, s2 + max, end2); /* rhs */
max += RatcliffObershelp(st1, s1, st2, s2); /* lhs */
return max;
}
static float ratcliff(char *s1, char *s2)
/* compute Ratcliff-Obershelp similarity of two strings */
{
short l1, l2;
l1 = strlen(s1);
l2 = strlen(s2);
/* exact match end-case */
if (l1 == 1 && l2 == 1 && *s1 == *s2)
return(1.0);
return 2.0 * RatcliffObershelp(s1, s1 + l1, s2, s2 + l2) / (l1 + l2);
}
static PyObject *
simil_ratcliff(PyObject *self, PyObject *args)
{
char *str1, *str2;
if(!PyArg_ParseTuple(args, "ss:ratcliff", &str1, &str2))
return NULL;
return Py_BuildValue("f", ratcliff(str1, str2));
}
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
"Taking my gun away because I might shoot someone is like cutting my tongue
out because I might yell `Fire!' in a crowded theater."
-- Peter Venetoklis