[Python-Dev] Why is soundex marked obsolete?

Eric S. Raymond esr@thyrsus.com
Sun, 14 Jan 2001 14:09:01 -0500


Tim Peters <tim.one@home.com>:
> > I think you've just made an argument for replacing your
> > SequenceMatcher with simil.ratcliff.
> 
> Actually, I'm certain they're the same algorithm now, except the C is
> showing through in ratcliff to the floating-point eye <wink>.

Take a look:

/*****************************************************************************
 *
 * Ratcliff-Obershelp common-subpattern similarity.
 *
 * This code first appeared in a letter to the editor in Doctor
 * Dobbs's Journal, 11/1988.  The original article on the algorithm,
 * "Pattern Matching by Gestalt" by John Ratcliff, had appeared in the
 * July 1988 issue (#181) but the algorithm was presented in assembly.
 * The main drawback of the Ratcliff-Obershelp algorithm is the cost
 * of the pairwise comparisons.  It is significantly more expensive
 * than stemming, Hamming distance, soundex, and the like.
 *
 * Running time quadratic in the data size, memory usage constant.
 *
 *****************************************************************************/

static int RatcliffObershelp(char *st1, char *end1, char *st2, char *end2)
{
    register char *a1, *a2;
    char *b1, *b2; 
    char *s1 = st1, *s2 = st2;	/* initializations are just to pacify GCC */
    short max, i;

    if (end1 <= st1 || end2 <= st2)
	return(0);
    if (end1 == st1 + 1 && end2 == st2 + 1)
	return(0);
		
    max = 0;
    b1 = end1; b2 = end2;
	
    for (a1 = st1; a1 < b1; a1++)
    {
	for (a2 = st2; a2 < b2; a2++)
	{
	    if (*a1 == *a2)
	    {
		/* determine length of common substring */
		for (i = 1; a1[i] && (a1[i] == a2[i]); i++) 
		    continue;
		if (i > max)
		{
		    max = i; s1 = a1; s2 = a2;
		    b1 = end1 - max; b2 = end2 - max;
		}
	    }
	}
    }
    if (!max)
	return(0);
    max += RatcliffObershelp(s1 + max, end1, s2 + max, end2);	/* rhs */
    max += RatcliffObershelp(st1, s1, st2, s2);			/* lhs */
    return max;
}

static float ratcliff(char *s1, char *s2)
/* compute Ratcliff-Obershelp similarity of two strings */
{
    short l1, l2;

    l1 = strlen(s1);
    l2 = strlen(s2);
	
    /* exact match end-case */
    if (l1 == 1 && l2 == 1 && *s1 == *s2)
	return(1.0);
			
    return 2.0 * RatcliffObershelp(s1, s1 + l1, s2, s2 + l2) / (l1 + l2);
}

static PyObject *
simil_ratcliff(PyObject *self, PyObject *args)
{
    char *str1, *str2;
    
    if(!PyArg_ParseTuple(args, "ss:ratcliff", &str1, &str2))
        return NULL;

    return Py_BuildValue("f", ratcliff(str1, str2));
}
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

"Taking my gun away because I might shoot someone is like cutting my tongue
out because I might yell `Fire!' in a crowded theater."
        -- Peter Venetoklis