ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: soundex algorithm
Submitter: Greg Jorgensen (other recipes)
Last Updated: 2001/03/06
Version no: 1.0
Category: Text, Algorithms

 

4 stars 3 vote(s)


Approved

Description:

Function to generate soundex code for any string (usually a name). Conforms to Knuth's algorithm and the common Perl implementation.

Source: Text Source

def soundex(name, len=4):
    """ soundex module conforming to Knuth's algorithm
        implementation 2000-12-24 by Gregory Jorgensen
        public domain
    """

    # digits holds the soundex values for the alphabet
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    # translate alpha chars in name to soundex digits
    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c   # remember first letter
            d = digits[ord(c)-ord('A')]
            # duplicate consecutive soundex digits are skipped
            if not sndx or (d != sndx[-1]):
                sndx += d

    # replace first digit with first alpha character
    sndx = fc + sndx[1:]

    # remove all 0s from the soundex code
    sndx = sndx.replace('0','')

    # return soundex code padded to len characters
    return (sndx + (len * '0'))[:len]

Discussion:



Add comment

Number of comments: 1

Warning: This is designed for English names, Scott David Daniels, 2001/06/23
Warning: This algorithm (by Odell and Russell, as reported in Knuth) is designed for English language surnames. If you have a significant number of non-English surnames, you might do well to alter the values in digits to improve your matches. For example, to accomodate a large number of Spanish surname data, you should count 'J' and 'L' ('L' because of the way 'll' is used) as vowels, setting their position in digit to '0'.
The basic assumptions of Soundex are that the consonants are more important than the vowels, and that the consonants are grouped into "confusable" groups. Coming up with a set of confusables for a language is not horribly tough, but remember: each group should contain all letters that are confusable with any of those in the group. a slightly better code for both English and Spanish names has digits = '01230120002055012623010202'.
Add comment



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. IPy Notify

4. Treat the Win32 Registry ...

5. a friendly mkdir()

6. Wrapping template engine ...

7. Assignment in expression

8. Changing return value ...

9. Implementation of sets ...

10. bag collection class




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.