[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sun, 14 Jan 2001 14:00:21 -0500


Very quick (swamped):

> I think you've just made an argument for replacing your
> SequenceMatcher with simil.ratcliff.

Actually, I'm certain they're the same algorithm now, except the C is
showing through in ratcliff to the floating-point eye <wink>.  For
demonstration, I *always* printed the top three scorers (that's logic in the
little driver I posted, not in SequenceMatcher), without any notion of
cutoff (ndiff does use a cutoff).  Add this line before the return (in the
posted driver) to see the actual scores:

    print scores[:numchoices]

For example:

Module name? browser
[(0.82352941176470584, 'webbrowser'),
 (0.55555555555555558, 'robotparser'),
 (0.54545454545454541, 'user')]
Hmm.  My best guesses are webbrowser, robotparser, user
Module name?

On this example you reported:

>>> simil.ratcliff("browser", "webbrowser")
0.82352942228317261
>>> simil.ratcliff("browser", "robotparser")
0.55555558204650879
>>> simil.ratcliff("browser", "user")
0.54545456171035767

which strongly suggests you're using C floats instead of Python floats to
compute the final score.  I didn't try every example in your email, but it's
the same story on the three I did try (scores identical modulo
simil.ratcliff dropping about 30 of the low-order result bits -- which is
about the difference between a C double and a C float on most boxes).

> Mine's even documented. :-).

Which I appreciate!  I dreamt up the SequenceMatcher algorithm going on 20
years ago for a friendly diff generator, and never even considered using it
for other purposes.  But then I may have mentioned that these other purposes
never come up in my apps <wink>.

or-at-least-they-haven't-in-contexts-where-r/o-would-have-been-
    strong-enough-ly y'rs  - tim