soundex (revisited)
Greg Jorgensen
gregj at pobox.com
Mon Dec 25 04:55:04 EST 2000
"Daniel Klein" <DanielK at aracnet.com> wrote in message
news:Var16.263$LU6.109277 at typhoon.aracnet.com...
> After seeing the post from several days ago on soundex, I gave it whirl to
> see if I could come up with something different (and possibly better),
> following the rules laid down by Knuth:
>
> def get_soundex(name, digits = 3):
> soundexcodes = "01230120022455012623010202"
> # ABCDEFGHIJKLMNOPQRSTUVWXYZ
> instring = name.upper()
> soundex = instring[0]
> last = soundex
> instring = instring[1:]
> for char in instring:
> if 65 <= ord(char) <= 90:
> sx = soundexcodes[ord(char) - 65]
> if int(sx) and char != last:
> soundex += sx
> last = char
> if len(soundex) < (digits + 1): soundex = (soundex + ("0" * digits))
> return soundex[:digits + 1]
I see a few problems, mainly in the handling of consecutive consonants. You
are checking for consecutive characters, but the Soundex algorithm specifies
that consecutive character codes be treated as a single code. Both 'mm' and
'mn' are considered consecutive codes because both 'm' and 'n' are coded as
5.
You can (and probably should) use the isalpha() string method to check for
alpha characters, rather than the 'magic numbers' 65 through 90. Likewise
ord(char) - ord('A') is a bit more clear.
Here's a version I wrote. I'm open to any criticisms, suggestions, etc. I
compared my version to the module announced here a while back (I think mine
is a lot more readable; it is certainly shorter). I also compared it to a
Perl version I found and I think my implementation is more robust and
smaller.
def soundex(name, len=4):
""" soundex module conforming to Knuth's algorithm
implementation 2000-12-24 by Gregory Jorgensen
public domain
"""
# digits holds the soundex values for the alphabet
digits = '01230120022455012623010202'
sndx = ''
fc = ''
# translate alpha chars in name to soundex digits
for c in name.upper():
if c.isalpha():
if not fc: fc = c # remember first letter
d = digits[ord(c)-ord('A')]
# duplicate consecutive soundex digits are skipped
if not sndx or (d != sndx[-1]):
sndx += d
# replace first digit with first alpha character
sndx = fc + sndx[1:]
# remove all 0s from the soundex code
sndx = sndx.replace('0','')
# return soundex code padded to len characters
return (sndx + (len * '0'))[:len]
--
Greg Jorgensen
Deschooling Society
Portland, Oregon, USA
gregj at pobox.com
More information about the Python-list
mailing list