[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sun, 14 Jan 2001 15:31:06 -0500

> Actually, I'm certain they're the same algorithm now, except the C is
> showing through in ratcliff to the floating-point eye <wink>.

> Take a look:

Yup, same thing, except:

> static float ratcliff(char *s1, char *s2)

accounts for the numeric differences (change "float"->"double" and they'd be
the same; Python has to convert it to a double anyway, lacking any internal
support for C's floats; and the C code is *computing* in double regardless,
cutting it back to a float upon return just because of the "float" decl).

The code in SequenceMatcher doesn't *look* anything like it, though, due to
years of dreaming up faster ways to do this (in its original role as a diff
generator, it routinely had to deal with sequences containing 10s of
thousands of elements, and code very much like the code you posted was just
too slow for that).

One simple trick that can enormously speed the worst cases:  the "find the
longest match starting here" innermost loop is guarded by

> 	    if (*a1 == *a2)

However, it can't possibly find a *bigger* max unless it's also the case

    a1[max) == a2[max)

That's usually false in real life, so by adding that test to the guard you
usually get to skip the innermost loop entirely.  Probably more important in
a diff-generator role, though.

SequenceMatcher's prime trick is to preprocess one of the strings, in linear
time building up a hash table mapping each character in the string to a list
of the indices at which it appears.  Then the second-innermost loop is saved
from needing to do any search:  when we get to, e.g., 'x' in the other
string, the precomputed hash table tells us directly where to find all the
x's in the original string.  And in the match-1-against-N case, this hash
table can be computed once & reused N times.  That's a monster win.

However, I never had the patience to code that in C, so I never *did* that
before I reimplemented my stuff in Python.  Now the Python ndiff runs
circles around the old Pascal and C versions.  I'm sure that has nothing to
do with machines having gotten 100x faster in the meantime <wink>>

    y'rs  - tim