|
|
 |
|
Title: soundex algorithm
Submitter: Greg Jorgensen
(other recipes)
Last Updated: 2001/03/06
Version no: 1.0
Category:
Text, Algorithms
|
|
3 vote(s)
|
|
|
|
Description:
Function to generate soundex code for any string (usually a name). Conforms to Knuth's algorithm and the common Perl implementation.
Source: Text Source
def soundex(name, len=4):
""" soundex module conforming to Knuth's algorithm
implementation 2000-12-24 by Gregory Jorgensen
public domain
"""
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
Discussion:
|
|
Add comment
|
|
Number of comments: 1
Warning: This is designed for English names, Scott David Daniels, 2001/06/23
Warning:
This algorithm (by Odell and Russell, as reported in Knuth) is designed for English language surnames. If you have a significant number of non-English surnames, you might do well to alter the values in digits to improve your matches.
For example, to accomodate a large number of Spanish surname data, you should count 'J' and 'L' ('L' because of the way 'll' is used) as vowels, setting their position in digit to '0'.
The basic assumptions of Soundex are that the consonants are more important than the vowels, and that the consonants are grouped into "confusable" groups. Coming up with a set of confusables
for a language is not horribly tough, but remember: each group should contain all letters that are confusable with any of those in the group. a slightly better code for both English and Spanish names has digits = '01230120002055012623010202'.
Add comment
|
|
|
|
|
 |
|