[Python-Dev] Why is soundex marked obsolete?
Tim Peters
tim.one@home.com
Sun, 14 Jan 2001 14:00:21 -0500
Very quick (swamped):
> I think you've just made an argument for replacing your
> SequenceMatcher with simil.ratcliff.
Actually, I'm certain they're the same algorithm now, except the C is
showing through in ratcliff to the floating-point eye <wink>. For
demonstration, I *always* printed the top three scorers (that's logic in the
little driver I posted, not in SequenceMatcher), without any notion of
cutoff (ndiff does use a cutoff). Add this line before the return (in the
posted driver) to see the actual scores:
print scores[:numchoices]
For example:
Module name? browser
[(0.82352941176470584, 'webbrowser'),
(0.55555555555555558, 'robotparser'),
(0.54545454545454541, 'user')]
Hmm. My best guesses are webbrowser, robotparser, user
Module name?
On this example you reported:
>>> simil.ratcliff("browser", "webbrowser")
0.82352942228317261
>>> simil.ratcliff("browser", "robotparser")
0.55555558204650879
>>> simil.ratcliff("browser", "user")
0.54545456171035767
which strongly suggests you're using C floats instead of Python floats to
compute the final score. I didn't try every example in your email, but it's
the same story on the three I did try (scores identical modulo
simil.ratcliff dropping about 30 of the low-order result bits -- which is
about the difference between a C double and a C float on most boxes).
> Mine's even documented. :-).
Which I appreciate! I dreamt up the SequenceMatcher algorithm going on 20
years ago for a friendly diff generator, and never even considered using it
for other purposes. But then I may have mentioned that these other purposes
never come up in my apps <wink>.
or-at-least-they-haven't-in-contexts-where-r/o-would-have-been-
strong-enough-ly y'rs - tim