soundex (revisited)

Greg Jorgensen gregj at pobox.com
Mon Dec 25 04:55:04 EST 2000


"Daniel Klein" <DanielK at aracnet.com> wrote in message
news:Var16.263$LU6.109277 at typhoon.aracnet.com...
> After seeing the post from several days ago on soundex, I gave it whirl to
> see if I could come up with something different (and possibly better),
> following the rules laid down by Knuth:
>
> def get_soundex(name, digits = 3):
>     soundexcodes = "01230120022455012623010202"
>     #               ABCDEFGHIJKLMNOPQRSTUVWXYZ
>     instring = name.upper()
>     soundex = instring[0]
>     last = soundex
>     instring = instring[1:]
>     for char in instring:
>         if 65 <= ord(char) <= 90:
>             sx = soundexcodes[ord(char) - 65]
>             if int(sx) and char != last:
>                 soundex += sx
>                 last = char
>     if len(soundex) < (digits + 1): soundex = (soundex + ("0" * digits))
>     return soundex[:digits + 1]

I see a few problems, mainly in the handling of consecutive consonants. You
are checking for consecutive characters, but the Soundex algorithm specifies
that consecutive character codes be treated as a single code. Both 'mm' and
'mn' are considered consecutive codes because both 'm' and 'n' are coded as
5.

You can (and probably should) use the isalpha() string method to check for
alpha characters, rather than the 'magic numbers' 65 through 90. Likewise
ord(char) - ord('A') is a bit more clear.

Here's a version I wrote. I'm open to any criticisms, suggestions, etc. I
compared my version to the module announced here a while back (I think mine
is a lot more readable; it is certainly shorter). I also compared it to a
Perl version I found and I think my implementation is more robust and
smaller.

def soundex(name, len=4):
    """ soundex module conforming to Knuth's algorithm
        implementation 2000-12-24 by Gregory Jorgensen
        public domain
    """

    # digits holds the soundex values for the alphabet
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    # translate alpha chars in name to soundex digits
    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c   # remember first letter
            d = digits[ord(c)-ord('A')]
            # duplicate consecutive soundex digits are skipped
            if not sndx or (d != sndx[-1]):
                sndx += d

    # replace first digit with first alpha character
    sndx = fc + sndx[1:]

    # remove all 0s from the soundex code
    sndx = sndx.replace('0','')

    # return soundex code padded to len characters
    return (sndx + (len * '0'))[:len]


--
Greg Jorgensen
Deschooling Society
Portland, Oregon, USA
gregj at pobox.com





More information about the Python-list mailing list